LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices



Reply
 
Search this Thread
Old 10-31-2012, 08:03 AM   #1
Adzrules
LQ Newbie
 
Registered: Oct 2012
Posts: 14

Rep: Reputation: Disabled
Question Search text file for records in another text file and pull extra data over to new


Hey guys,

I'm a massive newbie to the whole Linux scene but am enjoying the flexibility offered so far.

However, I have searched for a while and cannot work out how to tackle this problem.

I basically have two files: file_1.txt and file_2.txt

file_1.txt contains a list of genes, with each line containing a new gene e.g.:
CDS.002
CDS.005
CDS.035
etc.

file_2.txt contains the original gene data in the following format:
>CDS.001
MSENGNKNIAIVEAFSETDKKTGEVVTLVPNTNNTVQPVALMRLGLFVPTLKSTSRGR
>CDS.002
MALTLAGLEIEKTSGYWRAKGFKQPGILERLE
>CDS.003
MKAIPFALLFLSSIVVADTTVYQCEMSVADVKNGALTDVIKAPYGAMVVDSGDQFYVVRDDRVLSSPYLTNRNGKLTGVGEDHFVYNKYKGFYGVHASQQ SYLFDDCKEVG
etc.

Basically I want to search file_2.txt for all of the genes in file_1.txt. If found, I then want to put both the gene name (e.g. >CDS.002) AND the gene sequence (e.g. MALTLAGLEIEKTSGYWRAKGFKQPGILERLE) from file_2.txt into a new text file. I want all of the results in the final text file.

So, I'd end up with a text file similar to this:
>CDS.002
MSENGNKIAIVEAFSETDKKTGEVVTLVPNTNNTVQPVALMRLGLFVPTLKSTSRGR
>CDS.005
MALTLAGLEIEKTSGYWRAKGFKQPGILERLEYVVRDDRVLSSPYLTNRNGKLTGVGEDHFVYN
>CDS.035
MKAIPFALLFLSSIVVADTTVYQCEMSVADVKNGALTDVIKAPYGAMVVDSGDQFKYKGFYGVHASQQSYLFDDCKEVGFSETDKKTGEVVTLVPAGLEI EKTSGYWRAKGFKQPGILERLEYVVRDD
etc.

Now I have no idea how complicated this is for someone that knows their stuff! But for me, it's a little over my head at this stage. I'm very new to using the Terminal, and am only just starting out learning Perl.

Any assistance would be greeted with a thousand thanks

Thanks for any help!
Adzrules

Last edited by Adzrules; 10-31-2012 at 08:05 AM.
 
Old 10-31-2012, 08:23 AM   #2
schneidz
Senior Member
 
Registered: May 2005
Location: boston, usa
Distribution: fc-15/ fc-20-live-usb/ aix
Posts: 4,197

Rep: Reputation: 642Reputation: 642Reputation: 642Reputation: 642Reputation: 642Reputation: 642
this probably can be done in perl but i dont know how. you can probably bash it out with a while loop and grep (check the man page for the -A option).

Last edited by schneidz; 10-31-2012 at 08:26 AM. Reason: changed for loop to while loop
 
2 members found this post helpful.
Old 10-31-2012, 08:29 AM   #3
millgates
Member
 
Registered: Feb 2009
Location: 192.168.x.x
Distribution: Slackware
Posts: 651

Rep: Reputation: 269Reputation: 269Reputation: 269
Quote:
Originally Posted by schneidz View Post
this probably can be done in perl but i dont know how. you can probably bash it out with a while loop and grep (check the man page for the -A option).
Nice! I didn't even know of the grep -A option...
To add my 2 cents, with the grep -f option, you don't even need the loop
 
2 members found this post helpful.
Old 10-31-2012, 01:38 PM   #4
Adzrules
LQ Newbie
 
Registered: Oct 2012
Posts: 14

Original Poster
Rep: Reputation: Disabled
Thanks for the replies guys. No problem if you don't want to, but could you give me an example of how fgrep would work in this instance? Having trouble finding the relevant info online in a format that I can understand!
 
Old 10-31-2012, 01:43 PM   #5
schneidz
Senior Member
 
Registered: May 2005
Location: boston, usa
Distribution: fc-15/ fc-20-live-usb/ aix
Posts: 4,197

Rep: Reputation: 642Reputation: 642Reputation: 642Reputation: 642Reputation: 642Reputation: 642
^ it is hard to correct your mistake since you dont post what you tried and what error you are getting.
 
Old 10-31-2012, 01:53 PM   #6
Adzrules
LQ Newbie
 
Registered: Oct 2012
Posts: 14

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by schneidz View Post
^ it is hard to correct your mistake since you dont post what you tried and what error you are getting.
My apologies. I'm using this website here: http://kb.iu.edu/data/afiy.html to help me get to terms with things.

So I've just used the following: for i in `cat file_1.txt`; do grep $i file_2.txt; done

Which shows the correct genes, but how can I modify that to make it pull the gene sequence that is present on the lines below, but not the next gene name???? Any help is MUCH appreciated!

Last edited by Adzrules; 10-31-2012 at 02:08 PM. Reason: Changed code used
 
Old 10-31-2012, 02:08 PM   #7
schneidz
Senior Member
 
Registered: May 2005
Location: boston, usa
Distribution: fc-15/ fc-20-live-usb/ aix
Posts: 4,197

Rep: Reputation: 642Reputation: 642Reputation: 642Reputation: 642Reputation: 642Reputation: 642
Quote:
Originally Posted by Adzrules View Post
My apologies. I'm using this website here: http://kb.iu.edu/data/afiy.html to help me get to terms with things.

So far, I figure that if I use: fgrep -f file_2.txt file_1.txt
Then that should search file_1 for all occurrences found in file_2? No? I would then look at expanding that to pipe the results into some kind of function to grab the gene sequences and extra information from text_1.txt as well, and then > new_file.txt for the final results.
combining mine and millgate's suggestions, this should work:
Code:
fgrep -A 1 -f file_2.txt file_1.txt > /whatever/floats/your/boat.txt

Last edited by schneidz; 10-31-2012 at 02:09 PM.
 
1 members found this post helpful.
Old 10-31-2012, 02:21 PM   #8
Adzrules
LQ Newbie
 
Registered: Oct 2012
Posts: 14

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by schneidz View Post
combining mine and millgate's suggestions, this should work:
Code:
fgrep -A 1 -f file_2.txt file_1.txt > /whatever/floats/your/boat.txt
Thanks, but that just seems to write nothing to a text file. The file gets created and no errors are reported but the text file output is blank!
 
Old 10-31-2012, 02:24 PM   #9
schneidz
Senior Member
 
Registered: May 2005
Location: boston, usa
Distribution: fc-15/ fc-20-live-usb/ aix
Posts: 4,197

Rep: Reputation: 642Reputation: 642Reputation: 642Reputation: 642Reputation: 642Reputation: 642
Quote:
Originally Posted by Adzrules View Post
Thanks, but that just seems to write nothing to a text file. The file gets created and no errors are reported but the text file output is blank!
can you post this diagnostic info:
Code:
head file_2.txt file_1.txt
grep -A 1 -f file_2.txt file_1.txt
 
Old 10-31-2012, 05:53 PM   #10
Adzrules
LQ Newbie
 
Registered: Oct 2012
Posts: 14

Original Poster
Rep: Reputation: Disabled
Sure, here's what I get:


Code:
dave@dave-VirtualBox ~/Downloads/test $ head file_2.txt file_1.txt
==> file_2.txt <==
>CDS.001
MSENGNKNIAIVEAFSETDKKTGEVVTLVPNTNNTVQPVALMRLGLFVPTLKSTSRGRKGQMVSMDASAELKQLSLAKAEGYEDIRISGLRLDMDNDFKTWVGIIHAFAKHKVVGDTVTLPFVEFVRLCGIPTARSSAKLRKRLDSSLSRIATNTISFRSKGSDEFYVTHLVQTAKYSVKHDTVELKADPKIFELYQFDKKVLLQLRAINELSRKESAQALYTFIESLPPDPAPISLARLRARLNLTSRTITQNATVRKAMEQLREIGYLDYTEVKRGNSVYFVIHYRRPKLRQAQISTKIDNDETEYSLPDENQDDIIDVVPDEKEGKMVMLSKEELALLEELRKAKTRK
>CDS.002
MALTLAGLEIEKTSGYWRAKGFKQPGILERLEREDGYIVHQRREWRMYNPETGKLTTKAGTLWGLLKKIH
>CDS.003
MKAIPFALLFLSSIVVADTTVYQCEMSVADVKNGALTDVIKAPYGAMVVDSGDQFYVVRDDRVLSSPYLTNRNGKLTGVGEDHFVYNKYKGFYGVHASQQSYLFDDCKEVG
>CDS.004
MKIFIEYLLLIVSIAFVIDCIFTGVIRKVFSPVHDVVINALAIVLVFNSAFDVIKEVAA
>CDS.005
MRVLVRIVTSTVYDVFPVFMVKADGLNDEETDALIQRILVEYTGHDADSVMVDDDGVCWHNGNCWYVEETQQISDEDAEHLERILSISTFE

==> file_1.txt <==
CDS.015  	HCM2.0015c  
CDS.117  	HCM2.0122c  
CDS.096  	HCM2.0104c  
CDS.060  	HCM2.0069c  
CDS.068  	HCM2.0078  
CDS.061  	HCM2.0070c  
CDS.027  	HCM2.0035c  
CDS.031  	HCM2.0041c  
CDS.030  	HCM2.0040c  
CDS.116  	HCM2.0121c  
dave@dave-VirtualBox ~/Downloads/test $ grep -A 1 -f file_2.txt file_1.txt
dave@dave-VirtualBox ~/Downloads/test $
 
Old 10-31-2012, 06:11 PM   #11
schneidz
Senior Member
 
Registered: May 2005
Location: boston, usa
Distribution: fc-15/ fc-20-live-usb/ aix
Posts: 4,197

Rep: Reputation: 642Reputation: 642Reputation: 642Reputation: 642Reputation: 642Reputation: 642
in your first post you explain that file_1.txt contains a list of stuff like:
Quote:
Originally Posted by Adzrules View Post
CDS.002
CDS.005
CDS.035
etc.
but it really looks like:
Quote:
Originally Posted by Adzrules View Post
Code:
...==> file_1.txt <==
CDS.015  	HCM2.0015c  
CDS.117  	HCM2.0122c  
CDS.096  	HCM2.0104c  
CDS.060  	HCM2.0069c  
CDS.068  	HCM2.0078  
CDS.061  	HCM2.0070c  
CDS.027  	HCM2.0035c  
CDS.031  	HCM2.0041c  
CDS.030  	HCM2.0040c  
CDS.116  	HCM2.0121c
so according to the above the string "CDS.015 HCM2.0015c" doesnt exist in file_2.txt.
 
Old 10-31-2012, 06:34 PM   #12
Adzrules
LQ Newbie
 
Registered: Oct 2012
Posts: 14

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by schneidz View Post
in your first post you explain that file_1.txt contains a list of stuff like:but it really looks like:so according to the above the string "CDS.015 HCM2.0015c" doesnt exist in file_2.txt.
Apologies, I selected the wrong file (I renamed them to file_1.txt etc. for the benefit of your viewing). This is what it really is:

Code:
dave@dave-VirtualBox ~/Downloads/test $ head file_2.txt file_1.txt
==> file_2.txt <==
>CDS.001
MSENGNKNIAIVEAFSETDKKTGEVVTLVPNTNNTVQPVALMRLGLFVPTLKSTSRGRKGQMVSMDASAELKQLSLAKAEGYEDIRISGLRLDMDNDFKTWVGIIHAFAKHKVVGDTVTLPFVEFVRLCGIPTARSSAKLRKRLDSSLSRIATNTISFRSKGSDEFYVTHLVQTAKYSVKHDTVELKADPKIFELYQFDKKVLLQLRAINELSRKESAQALYTFIESLPPDPAPISLARLRARLNLTSRTITQNATVRKAMEQLREIGYLDYTEVKRGNSVYFVIHYRRPKLRQAQISTKIDNDETEYSLPDENQDDIIDVVPDEKEGKMVMLSKEELALLEELRKAKTRK
>CDS.002
MALTLAGLEIEKTSGYWRAKGFKQPGILERLEREDGYIVHQRREWRMYNPETGKLTTKAGTLWGLLKKIH
>CDS.003
MKAIPFALLFLSSIVVADTTVYQCEMSVADVKNGALTDVIKAPYGAMVVDSGDQFYVVRDDRVLSSPYLTNRNGKLTGVGEDHFVYNKYKGFYGVHASQQSYLFDDCKEVG
>CDS.004
MKIFIEYLLLIVSIAFVIDCIFTGVIRKVFSPVHDVVINALAIVLVFNSAFDVIKEVAA
>CDS.005
MRVLVRIVTSTVYDVFPVFMVKADGLNDEETDALIQRILVEYTGHDADSVMVDDDGVCWHNGNCWYVEETQQISDEDAEHLERILSISTFE

==> file_1.txt <==
HCM2.0015c  
HCM2.0122c  
HCM2.0104c  
HCM2.0069c  
HCM2.0078  
HCM2.0070c  
HCM2.0035c  
HCM2.0041c  
HCM2.0040c  
HCM2.0121c  
dave@dave-VirtualBox ~/Downloads/test $ grep -A 1 -f file_2.txt file_1.txt
dave@dave-VirtualBox ~/Downloads/test $
 
Old 10-31-2012, 06:41 PM   #13
schneidz
Senior Member
 
Registered: May 2005
Location: boston, usa
Distribution: fc-15/ fc-20-live-usb/ aix
Posts: 4,197

Rep: Reputation: 642Reputation: 642Reputation: 642Reputation: 642Reputation: 642Reputation: 642
^ is there still a typo ?

i dont see how for example the string "HCM2.0015c" exists in file_2.txt ?
 
Old 10-31-2012, 07:22 PM   #14
cbtshare
Member
 
Registered: Jul 2009
Posts: 569

Rep: Reputation: 42
I just created a mock similar test and this works:

Quote:
fgrep -f file2.txt file1.txt
 
Old 10-31-2012, 08:11 PM   #15
Adzrules
LQ Newbie
 
Registered: Oct 2012
Posts: 14

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by schneidz View Post
^ is there still a typo ?

i dont see how for example the string "HCM2.0015c" exists in file_2.txt ?
Oh for goodness sake! Again, I'm being retarded, wrong file AGAIN! Sorry.

Code:
dave@dave-VirtualBox ~/Downloads/test $ head file_2.txt file_1.txt
==> file_2.txt <==
>CDS.001
MSENGNKNIAIVEAFSETDKKTGEVVTLVPNTNNTVQPVALMRLGLFVPTLKSTSRGRKGQMVSMDASAELKQLSLAKAEGYEDIRISGLRLDMDNDFKTWVGIIHAFAKHKVVGDTVTLPFVEFVRLCGIPTARSSAKLRKRLDSSLSRIATNTISFRSKGSDEFYVTHLVQTAKYSVKHDTVELKADPKIFELYQFDKKVLLQLRAINELSRKESAQALYTFIESLPPDPAPISLARLRARLNLTSRTITQNATVRKAMEQLREIGYLDYTEVKRGNSVYFVIHYRRPKLRQAQISTKIDNDETEYSLPDENQDDIIDVVPDEKEGKMVMLSKEELALLEELRKAKTRK
>CDS.002
MALTLAGLEIEKTSGYWRAKGFKQPGILERLEREDGYIVHQRREWRMYNPETGKLTTKAGTLWGLLKKIH
>CDS.003
MKAIPFALLFLSSIVVADTTVYQCEMSVADVKNGALTDVIKAPYGAMVVDSGDQFYVVRDDRVLSSPYLTNRNGKLTGVGEDHFVYNKYKGFYGVHASQQSYLFDDCKEVG
>CDS.004
MKIFIEYLLLIVSIAFVIDCIFTGVIRKVFSPVHDVVINALAIVLVFNSAFDVIKEVAA
>CDS.005
MRVLVRIVTSTVYDVFPVFMVKADGLNDEETDALIQRILVEYTGHDADSVMVDDDGVCWHNGNCWYVEETQQISDEDAEHLERILSISTFE

==> file_1.txt <==
CDS.015  
CDS.117  
CDS.096  
CDS.060  
CDS.068  
CDS.061  
CDS.027  
CDS.031  
CDS.030  
CDS.116  
dave@dave-VirtualBox ~/Downloads/test $ 
dave@dave-VirtualBox ~/Downloads/test $ grep -A 1 -f file_2.txt file_1.txt
dave@dave-VirtualBox ~/Downloads/test $
That's what it actually looks like. Definitely this time!

Code:
fgrep -f file2.txt file1.txt > test1.txt
That doesn't seem to work for me. Again, it just creates a blank text file!
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Find Text After String Search in Text File redir Linux - Newbie 12 08-02-2011 04:57 PM
How to do search & replace on a text file--need to extract URLs from a sitemap file Mountain Linux - General 3 04-05-2009 02:22 PM
text match pipe to file then delete from original text file create new dir automatic tr1px Linux - Newbie 6 09-10-2008 10:40 PM
How to parse text file to a set text column width and output to new text file? jsstevenson Programming 12 04-23-2008 03:36 PM
using DD to pull records out of a data file fhinkle Linux - Newbie 6 02-24-2005 05:20 PM


All times are GMT -5. The time now is 03:23 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration