LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Search text file for records in another text file and pull extra data over to new (https://www.linuxquestions.org/questions/linux-newbie-8/search-text-file-for-records-in-another-text-file-and-pull-extra-data-over-to-new-4175434885/)

Adzrules 10-31-2012 07:03 AM

Search text file for records in another text file and pull extra data over to new
 
Hey guys,

I'm a massive newbie to the whole Linux scene but am enjoying the flexibility offered so far.

However, I have searched for a while and cannot work out how to tackle this problem.

I basically have two files: file_1.txt and file_2.txt

file_1.txt contains a list of genes, with each line containing a new gene e.g.:
CDS.002
CDS.005
CDS.035
etc.

file_2.txt contains the original gene data in the following format:
>CDS.001
MSENGNKNIAIVEAFSETDKKTGEVVTLVPNTNNTVQPVALMRLGLFVPTLKSTSRGR
>CDS.002
MALTLAGLEIEKTSGYWRAKGFKQPGILERLE
>CDS.003
MKAIPFALLFLSSIVVADTTVYQCEMSVADVKNGALTDVIKAPYGAMVVDSGDQFYVVRDDRVLSSPYLTNRNGKLTGVGEDHFVYNKYKGFYGVHASQQ SYLFDDCKEVG
etc.

Basically I want to search file_2.txt for all of the genes in file_1.txt. If found, I then want to put both the gene name (e.g. >CDS.002) AND the gene sequence (e.g. MALTLAGLEIEKTSGYWRAKGFKQPGILERLE) from file_2.txt into a new text file. I want all of the results in the final text file.

So, I'd end up with a text file similar to this:
>CDS.002
MSENGNKIAIVEAFSETDKKTGEVVTLVPNTNNTVQPVALMRLGLFVPTLKSTSRGR
>CDS.005
MALTLAGLEIEKTSGYWRAKGFKQPGILERLEYVVRDDRVLSSPYLTNRNGKLTGVGEDHFVYN
>CDS.035
MKAIPFALLFLSSIVVADTTVYQCEMSVADVKNGALTDVIKAPYGAMVVDSGDQFKYKGFYGVHASQQSYLFDDCKEVGFSETDKKTGEVVTLVPAGLEI EKTSGYWRAKGFKQPGILERLEYVVRDD
etc.

Now I have no idea how complicated this is for someone that knows their stuff! But for me, it's a little over my head at this stage. I'm very new to using the Terminal, and am only just starting out learning Perl.

Any assistance would be greeted with a thousand thanks :)

Thanks for any help!
Adzrules

schneidz 10-31-2012 07:23 AM

this probably can be done in perl but i dont know how. you can probably bash it out with a while loop and grep (check the man page for the -A option).

millgates 10-31-2012 07:29 AM

Quote:

Originally Posted by schneidz (Post 4818880)
this probably can be done in perl but i dont know how. you can probably bash it out with a while loop and grep (check the man page for the -A option).

Nice! I didn't even know of the grep -A option...
To add my 2 cents, with the grep -f option, you don't even need the loop

Adzrules 10-31-2012 12:38 PM

Thanks for the replies guys. No problem if you don't want to, but could you give me an example of how fgrep would work in this instance? Having trouble finding the relevant info online in a format that I can understand!

schneidz 10-31-2012 12:43 PM

^ it is hard to correct your mistake since you dont post what you tried and what error you are getting.

Adzrules 10-31-2012 12:53 PM

Quote:

Originally Posted by schneidz (Post 4819090)
^ it is hard to correct your mistake since you dont post what you tried and what error you are getting.

My apologies. I'm using this website here: http://kb.iu.edu/data/afiy.html to help me get to terms with things.

So I've just used the following: for i in `cat file_1.txt`; do grep $i file_2.txt; done

Which shows the correct genes, but how can I modify that to make it pull the gene sequence that is present on the lines below, but not the next gene name???? Any help is MUCH appreciated!

schneidz 10-31-2012 01:08 PM

Quote:

Originally Posted by Adzrules (Post 4819100)
My apologies. I'm using this website here: http://kb.iu.edu/data/afiy.html to help me get to terms with things.

So far, I figure that if I use: fgrep -f file_2.txt file_1.txt
Then that should search file_1 for all occurrences found in file_2? No? I would then look at expanding that to pipe the results into some kind of function to grab the gene sequences and extra information from text_1.txt as well, and then > new_file.txt for the final results.

combining mine and millgate's suggestions, this should work:
Code:

fgrep -A 1 -f file_2.txt file_1.txt > /whatever/floats/your/boat.txt

Adzrules 10-31-2012 01:21 PM

Quote:

Originally Posted by schneidz (Post 4819109)
combining mine and millgate's suggestions, this should work:
Code:

fgrep -A 1 -f file_2.txt file_1.txt > /whatever/floats/your/boat.txt

Thanks, but that just seems to write nothing to a text file. The file gets created and no errors are reported but the text file output is blank!

schneidz 10-31-2012 01:24 PM

Quote:

Originally Posted by Adzrules (Post 4819115)
Thanks, but that just seems to write nothing to a text file. The file gets created and no errors are reported but the text file output is blank!

can you post this diagnostic info:
Code:

head file_2.txt file_1.txt
grep -A 1 -f file_2.txt file_1.txt


Adzrules 10-31-2012 04:53 PM

Sure, here's what I get:


Code:

dave@dave-VirtualBox ~/Downloads/test $ head file_2.txt file_1.txt
==> file_2.txt <==
>CDS.001
MSENGNKNIAIVEAFSETDKKTGEVVTLVPNTNNTVQPVALMRLGLFVPTLKSTSRGRKGQMVSMDASAELKQLSLAKAEGYEDIRISGLRLDMDNDFKTWVGIIHAFAKHKVVGDTVTLPFVEFVRLCGIPTARSSAKLRKRLDSSLSRIATNTISFRSKGSDEFYVTHLVQTAKYSVKHDTVELKADPKIFELYQFDKKVLLQLRAINELSRKESAQALYTFIESLPPDPAPISLARLRARLNLTSRTITQNATVRKAMEQLREIGYLDYTEVKRGNSVYFVIHYRRPKLRQAQISTKIDNDETEYSLPDENQDDIIDVVPDEKEGKMVMLSKEELALLEELRKAKTRK
>CDS.002
MALTLAGLEIEKTSGYWRAKGFKQPGILERLEREDGYIVHQRREWRMYNPETGKLTTKAGTLWGLLKKIH
>CDS.003
MKAIPFALLFLSSIVVADTTVYQCEMSVADVKNGALTDVIKAPYGAMVVDSGDQFYVVRDDRVLSSPYLTNRNGKLTGVGEDHFVYNKYKGFYGVHASQQSYLFDDCKEVG
>CDS.004
MKIFIEYLLLIVSIAFVIDCIFTGVIRKVFSPVHDVVINALAIVLVFNSAFDVIKEVAA
>CDS.005
MRVLVRIVTSTVYDVFPVFMVKADGLNDEETDALIQRILVEYTGHDADSVMVDDDGVCWHNGNCWYVEETQQISDEDAEHLERILSISTFE

==> file_1.txt <==
CDS.015          HCM2.0015c 
CDS.117          HCM2.0122c 
CDS.096          HCM2.0104c 
CDS.060          HCM2.0069c 
CDS.068          HCM2.0078 
CDS.061          HCM2.0070c 
CDS.027          HCM2.0035c 
CDS.031          HCM2.0041c 
CDS.030          HCM2.0040c 
CDS.116          HCM2.0121c 
dave@dave-VirtualBox ~/Downloads/test $ grep -A 1 -f file_2.txt file_1.txt
dave@dave-VirtualBox ~/Downloads/test $


schneidz 10-31-2012 05:11 PM

in your first post you explain that file_1.txt contains a list of stuff like:
Quote:

Originally Posted by Adzrules (Post 4818872)
CDS.002
CDS.005
CDS.035
etc.

but it really looks like:
Quote:

Originally Posted by Adzrules (Post 4819245)
Code:

...==> file_1.txt <==
CDS.015          HCM2.0015c 
CDS.117          HCM2.0122c 
CDS.096          HCM2.0104c 
CDS.060          HCM2.0069c 
CDS.068          HCM2.0078 
CDS.061          HCM2.0070c 
CDS.027          HCM2.0035c 
CDS.031          HCM2.0041c 
CDS.030          HCM2.0040c 
CDS.116          HCM2.0121c


so according to the above the string "CDS.015 HCM2.0015c" doesnt exist in file_2.txt.

Adzrules 10-31-2012 05:34 PM

Quote:

Originally Posted by schneidz (Post 4819253)
in your first post you explain that file_1.txt contains a list of stuff like:but it really looks like:so according to the above the string "CDS.015 HCM2.0015c" doesnt exist in file_2.txt.

Apologies, I selected the wrong file (I renamed them to file_1.txt etc. for the benefit of your viewing). This is what it really is:

Code:

dave@dave-VirtualBox ~/Downloads/test $ head file_2.txt file_1.txt
==> file_2.txt <==
>CDS.001
MSENGNKNIAIVEAFSETDKKTGEVVTLVPNTNNTVQPVALMRLGLFVPTLKSTSRGRKGQMVSMDASAELKQLSLAKAEGYEDIRISGLRLDMDNDFKTWVGIIHAFAKHKVVGDTVTLPFVEFVRLCGIPTARSSAKLRKRLDSSLSRIATNTISFRSKGSDEFYVTHLVQTAKYSVKHDTVELKADPKIFELYQFDKKVLLQLRAINELSRKESAQALYTFIESLPPDPAPISLARLRARLNLTSRTITQNATVRKAMEQLREIGYLDYTEVKRGNSVYFVIHYRRPKLRQAQISTKIDNDETEYSLPDENQDDIIDVVPDEKEGKMVMLSKEELALLEELRKAKTRK
>CDS.002
MALTLAGLEIEKTSGYWRAKGFKQPGILERLEREDGYIVHQRREWRMYNPETGKLTTKAGTLWGLLKKIH
>CDS.003
MKAIPFALLFLSSIVVADTTVYQCEMSVADVKNGALTDVIKAPYGAMVVDSGDQFYVVRDDRVLSSPYLTNRNGKLTGVGEDHFVYNKYKGFYGVHASQQSYLFDDCKEVG
>CDS.004
MKIFIEYLLLIVSIAFVIDCIFTGVIRKVFSPVHDVVINALAIVLVFNSAFDVIKEVAA
>CDS.005
MRVLVRIVTSTVYDVFPVFMVKADGLNDEETDALIQRILVEYTGHDADSVMVDDDGVCWHNGNCWYVEETQQISDEDAEHLERILSISTFE

==> file_1.txt <==
HCM2.0015c 
HCM2.0122c 
HCM2.0104c 
HCM2.0069c 
HCM2.0078 
HCM2.0070c 
HCM2.0035c 
HCM2.0041c 
HCM2.0040c 
HCM2.0121c 
dave@dave-VirtualBox ~/Downloads/test $ grep -A 1 -f file_2.txt file_1.txt
dave@dave-VirtualBox ~/Downloads/test $


schneidz 10-31-2012 05:41 PM

^ is there still a typo ?

i dont see how for example the string "HCM2.0015c" exists in file_2.txt ?

cbtshare 10-31-2012 06:22 PM

I just created a mock similar test and this works:

Quote:

fgrep -f file2.txt file1.txt

Adzrules 10-31-2012 07:11 PM

Quote:

Originally Posted by schneidz (Post 4819263)
^ is there still a typo ?

i dont see how for example the string "HCM2.0015c" exists in file_2.txt ?

Oh for goodness sake! Again, I'm being retarded, wrong file AGAIN! Sorry.

Code:

dave@dave-VirtualBox ~/Downloads/test $ head file_2.txt file_1.txt
==> file_2.txt <==
>CDS.001
MSENGNKNIAIVEAFSETDKKTGEVVTLVPNTNNTVQPVALMRLGLFVPTLKSTSRGRKGQMVSMDASAELKQLSLAKAEGYEDIRISGLRLDMDNDFKTWVGIIHAFAKHKVVGDTVTLPFVEFVRLCGIPTARSSAKLRKRLDSSLSRIATNTISFRSKGSDEFYVTHLVQTAKYSVKHDTVELKADPKIFELYQFDKKVLLQLRAINELSRKESAQALYTFIESLPPDPAPISLARLRARLNLTSRTITQNATVRKAMEQLREIGYLDYTEVKRGNSVYFVIHYRRPKLRQAQISTKIDNDETEYSLPDENQDDIIDVVPDEKEGKMVMLSKEELALLEELRKAKTRK
>CDS.002
MALTLAGLEIEKTSGYWRAKGFKQPGILERLEREDGYIVHQRREWRMYNPETGKLTTKAGTLWGLLKKIH
>CDS.003
MKAIPFALLFLSSIVVADTTVYQCEMSVADVKNGALTDVIKAPYGAMVVDSGDQFYVVRDDRVLSSPYLTNRNGKLTGVGEDHFVYNKYKGFYGVHASQQSYLFDDCKEVG
>CDS.004
MKIFIEYLLLIVSIAFVIDCIFTGVIRKVFSPVHDVVINALAIVLVFNSAFDVIKEVAA
>CDS.005
MRVLVRIVTSTVYDVFPVFMVKADGLNDEETDALIQRILVEYTGHDADSVMVDDDGVCWHNGNCWYVEETQQISDEDAEHLERILSISTFE

==> file_1.txt <==
CDS.015 
CDS.117 
CDS.096 
CDS.060 
CDS.068 
CDS.061 
CDS.027 
CDS.031 
CDS.030 
CDS.116 
dave@dave-VirtualBox ~/Downloads/test $
dave@dave-VirtualBox ~/Downloads/test $ grep -A 1 -f file_2.txt file_1.txt
dave@dave-VirtualBox ~/Downloads/test $

That's what it actually looks like. Definitely this time!

Code:

fgrep -f file2.txt file1.txt > test1.txt
That doesn't seem to work for me. Again, it just creates a blank text file!


All times are GMT -5. The time now is 11:43 AM.