Search and replace text in a csv file according to a text file
Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Search and replace text in a csv file according to a text file
Hi there,
I have a csv file with list of gene symbol. I have another text file with all the gene symbol with corresponding KEGG IDs/Title. I want to replace all the gene symbol in the csv file with corresponding KEGG titles. How can I do this? any suggestion?
Hi there,
I have a csv file with list of gene symbol. I have another text file with all the gene symbol with corresponding KEGG IDs/Title. I want to replace all the gene symbol in the csv file with corresponding KEGG titles. How can I do this? any suggestion?
You've shown us the two files, but haven't shown us what you want to replace/search for, or how you want the output to look. Most importantly, you haven't shown what you have done/tried to to do this. There are many ways to do this, but since we don't know what language(s) you know/want, we can't really offer much, aside from "you can write a script to do this".
This could be a bash script, perl, ruby, or python, and all could easily do this. Post what you've done, and tell us where you're stuck.
Thank you very much for your reply. I want a output like this:
Code:
Cparvum Bmicroti Tparva Pberghei
OG0000000 -
OG0000001 -
OG0000002 -
OG0000003 -
OG0000004 -
OG0000005 -
OG0000006 -
OG0000007 Protein processing in endoplasmic reticulum
OG0000008 Protein processing in endoplasmic reticulum
OG0000009 -
OG0000010 Ribosome biogenesis in eukaryotes
OG0000011 -
OG0000012 -
OG0000013 -
OG0000014 -
OG0000015 -
OG0000016 -
OG0000017 -
OG0000018 -
OG0000019 -
OG0000020 -
OG0000021 -
OG0000022 -
OG0000023 -
OG0000024 Protein processing in endoplasmic reticulum
OG0000025 Protein processing in endoplasmic reticulum
OG0000026 Ribosome biogenesis in eukaryotes
OG0000027 -
-
Then I want to convert it into a presence/absence matrix:
Code:
Cparvum Bmicroti Tparva Pberghei Pchabaudi Pcynomolgi Pfalciparum
Protein processing in endoplasmic reticulum 1 0 0 1 0 1 0
Ribosome biogenesis in eukaryotes 0 0 1 1 1 1
Ok...so again; can you post what you have done/tried on your own and what you've written to accomplish this? Show us where you're stuck?? We are happy to help you, but we aren't going to write your scripts for you.
And thank you for providing sample output...but it just isn't making much sense; what pattern in the text file are you searching for in the CSV, because I can't see what you're trying to search/replace. Instead of MANY lines of samples, post a few of the csv file, and highlight what you're searching for, and post a few of the text file with the relevant data in it.
Thank you very much for your quick reply and for details clarification. In my Orthogroups.txt file I have a list of gene names (The first row are organism names, 1st column is orthogroups name). Each cell contain gene symbol. In the mypathway.txt file I have all the same gene symbols in the 1st column and in the 2nd column corresponding pathway titles. I want to replace all the gene symbol in the Orthogroups.txt file with corresponding 2nd column of the 2nd file. Is it possible? Thanks again.
Which columns are supposed to be the same in each file? Can you show two or three lines that match from each file?
perl has the module Text::CSV which parses CSV properly. From there it is easy to merge two tables. If the two files are tab-delimited text then you might even just use join instead.
Thank you very much for your quick reply and for details clarification. In my Orthogroups.txt file I have a list of gene names (The first row are organism names, 1st column is orthogroups name). Each cell contain gene symbol. In the mypathway.txt file I have all the same gene symbols in the 1st column and in the 2nd column corresponding pathway titles. I want to replace all the gene symbol in the Orthogroups.txt file with corresponding 2nd column of the 2nd file. Is it possible? Thanks again.
First, yes, it's very possible.
Secondly, as you've been asked now several times:
POST WHAT YOU HAVE WRITTEN/DONE/TRIED of your own and tell us where you're stuck. We WILL NOT write your scripts for you, but will be happy to assist you if you're stuck
Can you, as you've been asked before, post one CLEAR example of what you want to search/replace, and what you want to see when you're done??? What you've posted so far makes no sense.
As Turbocapitalist said, perl would make short work of this, but you can also use any number of other languages as well. Since you haven't posted what you've done, what lanuage you're working in, or told us where you're stuck, we can't offer much in the way of advice on how to fix whatever problem you have.
It's not an inconvenience but it does need to be spelled out easily enough to retain any interest among forum members. You can put your data between [code] [/code] tags in the body of your post so they are accessible.
The data we are interested in would be a) two lines from mypathway.txt that match b) a line or more in Orthogroups.txt and c) which fields in a and b that should be examined for a match. All in all, it would be nice thus to see three groups of data between [code] [/code] tags. In that way you'll help us help you since we don't have familiarity with your data and we're volunteering our time for interesting questions like this one.
Thank you very much for helping me all the way. Here is mypathway.txt:
Code:
PVX_088085 Protein processing in endoplasmic reticulum
PVX_114095 Protein processing in endoplasmic reticulum
PVX_123055 Ribosome biogenesis in eukaryotes
PYYM_1032000 -
PYYM_1120600 -
The 1st column contains gene symbols which are extracted from orthogroups.csv file. mypathway.txt has only two columns. 2nd columns may contain multiple values (Text).
Here is my orthogroups.csv file:
Orthogroups.csv contains multiple column. The 1st row is column name and 1st column is the row name. Each cell may contain single/multiple gene symbols or maybe blank (these gene symbols are the 1st column of mypathway.txt) .
I want to replace the gene symbol in the csv file with corresponding 2nd column from my pathway.txt without changing the format of the csv file.
I tried the following perl script:
Code:
# This script was excerpted from http://stackoverflow.com/questions/11678939/replace-text-based-on-a-dictionary
use strict;
use warnings;
open my $fh, '<', 'bioDBnet_db2db_KEGG_Title_final.txt' or die $!;
my %dict = map { chomp; split ' ', $_, 2 } <$fh>;
my $re = join '|', keys %dict;
open $fh, '<', 'Orthogroups_3.csv' or die $!;
while (<$fh>) {
s/($re)/$dict{$1}/g;
print;
}
It gave me the expected output but changed the format of the csv file. The original csv contains 13 columns. But replaced csv showed only 1 column when I load it in R. I want to load the replaced csv in R then convert it into data matrix or binary matrix. The output I got:
I want the 1st row with column names will be same. I want text from each cells to be the row names and make a abundance or binary matrix. Is it possible?
Can you enlighten me a little more or give me some example.
I have converted the text in the csv using this perl script:http://stackoverflow.com/questions/1...n-a-dictionary
But it is giving me more columns than original and I can't load it the output in R.
Best Regards
Zillur
The first two lines just pull in modules to set the rules that the program must adhere to as far as syntax.
Code:
use strict;
use warnings;
The next three lines read in data from a text file and loads it into a hash (lookup table) called %dict. The script expects two columns here. The split() function there expects a single space to separate the columns and stops after the second column is found. You may wish to change the separator to a tab or even a span of white space. What is that file really using for a separator?
Code:
open my $fh, '<', 'mypathway.txt' or die $!;
my %dict = map { chomp; split ' ', $_, 2 } <$fh>;
my $re = join '|', keys %dict;
The next five lines open the data file and go through it line by line replacing any of the keys of the lookup table with the matching data from the lookup table. The first part of the substitution s/// is the other place where you should be paying attention if it is producing extra columns.
Edit: it is using the magic variable $_ which often does not need to be named explicitly.
Code:
open $fh, '<', 'orthogroups.csv' or die $!;
while (<$fh>) {
s/($re)/$dict{$1}/g;
print;
}
The code has been written for stylishness / trendiness not so much clarity. Though I must admit I am not with the "in" crowd as far as the current perl styles go.
Last edited by Turbocapitalist; 03-07-2017 at 12:49 PM.
use strict;
use warnings;
open my $fh, '<', 'k_all_1.txt' or die $!;
my %dict = map { chomp; split '\t', $_, 2 } <$fh>;
my $re = join '|', keys %dict;
open $fh, '<', 'Orthogroups_3.csv' or die $!;
while (<$fh>) {
s/($re)/$dict{$1}/g;
print;
}
It gave me output what I wanted. But when I wanted to load it in R:
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.