LinuxQuestions.org - [SOLVED] Search text file for records in another text file and pull extra data over to new

Page 1 of 2

Show 50 post(s) from this thread on one page

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - Search text file for records in another text file and pull extra data over to new (https://www.linuxquestions.org/questions/linux-newbie-8/search-text-file-for-records-in-another-text-file-and-pull-extra-data-over-to-new-4175434885/)

Adzrules

10-31-2012 07:03 AM

Search text file for records in another text file and pull extra data over to new

Hey guys,

I'm a massive newbie to the whole Linux scene but am enjoying the flexibility offered so far.

However, I have searched for a while and cannot work out how to tackle this problem.

I basically have two files: file_1.txt and file_2.txt

file_1.txt contains a list of genes, with each line containing a new gene e.g.:
CDS.002
CDS.005
CDS.035
etc.

file_2.txt contains the original gene data in the following format:
>CDS.001
MSENGNKNIAIVEAFSETDKKTGEVVTLVPNTNNTVQPVALMRLGLFVPTLKSTSRGR
>CDS.002
MALTLAGLEIEKTSGYWRAKGFKQPGILERLE
>CDS.003
MKAIPFALLFLSSIVVADTTVYQCEMSVADVKNGALTDVIKAPYGAMVVDSGDQFYVVRDDRVLSSPYLTNRNGKLTGVGEDHFVYNKYKGFYGVHASQQ SYLFDDCKEVG
etc.

Basically I want to search file_2.txt for all of the genes in file_1.txt. If found, I then want to put both the gene name (e.g. >CDS.002) AND the gene sequence (e.g. MALTLAGLEIEKTSGYWRAKGFKQPGILERLE) from file_2.txt into a new text file. I want all of the results in the final text file.

So, I'd end up with a text file similar to this:
>CDS.002
MSENGNKIAIVEAFSETDKKTGEVVTLVPNTNNTVQPVALMRLGLFVPTLKSTSRGR
>CDS.005
MALTLAGLEIEKTSGYWRAKGFKQPGILERLEYVVRDDRVLSSPYLTNRNGKLTGVGEDHFVYN
>CDS.035
MKAIPFALLFLSSIVVADTTVYQCEMSVADVKNGALTDVIKAPYGAMVVDSGDQFKYKGFYGVHASQQSYLFDDCKEVGFSETDKKTGEVVTLVPAGLEI EKTSGYWRAKGFKQPGILERLEYVVRDD
etc.

Now I have no idea how complicated this is for someone that knows their stuff! But for me, it's a little over my head at this stage. I'm very new to using the Terminal, and am only just starting out learning Perl.

Any assistance would be greeted with a thousand thanks :)

Thanks for any help!
Adzrules

schneidz

10-31-2012 07:23 AM

this probably can be done in perl but i dont know how. you can probably bash it out with a while loop and grep (check the man page for the -A option).

millgates

10-31-2012 07:29 AM

Quote:

Originally Posted by schneidz (Post 4818880)

this probably can be done in perl but i dont know how. you can probably bash it out with a while loop and grep (check the man page for the -A option).

Nice! I didn't even know of the grep -A option...
To add my 2 cents, with the grep -f option, you don't even need the loop

Adzrules

10-31-2012 12:38 PM

Thanks for the replies guys. No problem if you don't want to, but could you give me an example of how fgrep would work in this instance? Having trouble finding the relevant info online in a format that I can understand!

schneidz

10-31-2012 12:43 PM

^ it is hard to correct your mistake since you dont post what you tried and what error you are getting.

Adzrules

10-31-2012 12:53 PM

Quote:

Originally Posted by schneidz (Post 4819090)

^ it is hard to correct your mistake since you dont post what you tried and what error you are getting.

My apologies. I'm using this website here: http://kb.iu.edu/data/afiy.html to help me get to terms with things.

So I've just used the following: for i in `cat file_1.txt`; do grep $i file_2.txt; done

Which shows the correct genes, but how can I modify that to make it pull the gene sequence that is present on the lines below, but not the next gene name???? Any help is MUCH appreciated!

schneidz

10-31-2012 01:08 PM

Quote:

Originally Posted by Adzrules (Post 4819100)

My apologies. I'm using this website here: http://kb.iu.edu/data/afiy.html to help me get to terms with things.

So far, I figure that if I use: fgrep -f file_2.txt file_1.txt
Then that should search file_1 for all occurrences found in file_2? No? I would then look at expanding that to pipe the results into some kind of function to grab the gene sequences and extra information from text_1.txt as well, and then > new_file.txt for the final results.

combining mine and millgate's suggestions, this should work:

Code:

fgrep -A 1 -f file_2.txt file_1.txt > /whatever/floats/your/boat.txt

Adzrules

10-31-2012 01:21 PM

Quote:

Originally Posted by schneidz (Post 4819109)

combining mine and millgate's suggestions, this should work:

Code:

fgrep -A 1 -f file_2.txt file_1.txt > /whatever/floats/your/boat.txt

Thanks, but that just seems to write nothing to a text file. The file gets created and no errors are reported but the text file output is blank!

schneidz

10-31-2012 01:24 PM

Quote:

Originally Posted by Adzrules (Post 4819115)

Thanks, but that just seems to write nothing to a text file. The file gets created and no errors are reported but the text file output is blank!

can you post this diagnostic info:

Code:

head file_2.txt file_1.txt

grep -A 1 -f file_2.txt file_1.txt

Adzrules

10-31-2012 04:53 PM

Sure, here's what I get:

Code:

dave@dave-VirtualBox ~/Downloads/test $ head file_2.txt file_1.txt

==> file_2.txt <==

>CDS.001

MSENGNKNIAIVEAFSETDKKTGEVVTLVPNTNNTVQPVALMRLGLFVPTLKSTSRGRKGQMVSMDASAELKQLSLAKAEGYEDIRISGLRLDMDNDFKTWVGIIHAFAKHKVVGDTVTLPFVEFVRLCGIPTARSSAKLRKRLDSSLSRIATNTISFRSKGSDEFYVTHLVQTAKYSVKHDTVELKADPKIFELYQFDKKVLLQLRAINELSRKESAQALYTFIESLPPDPAPISLARLRARLNLTSRTITQNATVRKAMEQLREIGYLDYTEVKRGNSVYFVIHYRRPKLRQAQISTKIDNDETEYSLPDENQDDIIDVVPDEKEGKMVMLSKEELALLEELRKAKTRK

>CDS.002

MALTLAGLEIEKTSGYWRAKGFKQPGILERLEREDGYIVHQRREWRMYNPETGKLTTKAGTLWGLLKKIH

>CDS.003

MKAIPFALLFLSSIVVADTTVYQCEMSVADVKNGALTDVIKAPYGAMVVDSGDQFYVVRDDRVLSSPYLTNRNGKLTGVGEDHFVYNKYKGFYGVHASQQSYLFDDCKEVG

>CDS.004

MKIFIEYLLLIVSIAFVIDCIFTGVIRKVFSPVHDVVINALAIVLVFNSAFDVIKEVAA

>CDS.005

MRVLVRIVTSTVYDVFPVFMVKADGLNDEETDALIQRILVEYTGHDADSVMVDDDGVCWHNGNCWYVEETQQISDEDAEHLERILSISTFE



==> file_1.txt <==

CDS.015          HCM2.0015c  

CDS.117          HCM2.0122c  

CDS.096          HCM2.0104c  

CDS.060          HCM2.0069c  

CDS.068          HCM2.0078  

CDS.061          HCM2.0070c  

CDS.027          HCM2.0035c  

CDS.031          HCM2.0041c  

CDS.030          HCM2.0040c  

CDS.116          HCM2.0121c  

dave@dave-VirtualBox ~/Downloads/test $ grep -A 1 -f file_2.txt file_1.txt

dave@dave-VirtualBox ~/Downloads/test $

schneidz

10-31-2012 05:11 PM

in your first post you explain that file_1.txt contains a list of stuff like:

Quote:

Originally Posted by Adzrules (Post 4818872)

CDS.002
CDS.005
CDS.035
etc.

but it really looks like:

Quote:

Originally Posted by Adzrules (Post 4819245)

Code:

...==> file_1.txt <==

CDS.015          HCM2.0015c  

CDS.117          HCM2.0122c  

CDS.096          HCM2.0104c  

CDS.060          HCM2.0069c  

CDS.068          HCM2.0078  

CDS.061          HCM2.0070c  

CDS.027          HCM2.0035c  

CDS.031          HCM2.0041c  

CDS.030          HCM2.0040c  

CDS.116          HCM2.0121c

so according to the above the string "CDS.015 HCM2.0015c" doesnt exist in file_2.txt.

Adzrules

10-31-2012 05:34 PM

Quote:

Originally Posted by schneidz (Post 4819253)

in your first post you explain that file_1.txt contains a list of stuff like:but it really looks like:so according to the above the string "CDS.015 HCM2.0015c" doesnt exist in file_2.txt.

Apologies, I selected the wrong file (I renamed them to file_1.txt etc. for the benefit of your viewing). This is what it really is:

Code:

dave@dave-VirtualBox ~/Downloads/test $ head file_2.txt file_1.txt

==> file_2.txt <==

>CDS.001

MSENGNKNIAIVEAFSETDKKTGEVVTLVPNTNNTVQPVALMRLGLFVPTLKSTSRGRKGQMVSMDASAELKQLSLAKAEGYEDIRISGLRLDMDNDFKTWVGIIHAFAKHKVVGDTVTLPFVEFVRLCGIPTARSSAKLRKRLDSSLSRIATNTISFRSKGSDEFYVTHLVQTAKYSVKHDTVELKADPKIFELYQFDKKVLLQLRAINELSRKESAQALYTFIESLPPDPAPISLARLRARLNLTSRTITQNATVRKAMEQLREIGYLDYTEVKRGNSVYFVIHYRRPKLRQAQISTKIDNDETEYSLPDENQDDIIDVVPDEKEGKMVMLSKEELALLEELRKAKTRK

>CDS.002

MALTLAGLEIEKTSGYWRAKGFKQPGILERLEREDGYIVHQRREWRMYNPETGKLTTKAGTLWGLLKKIH

>CDS.003

MKAIPFALLFLSSIVVADTTVYQCEMSVADVKNGALTDVIKAPYGAMVVDSGDQFYVVRDDRVLSSPYLTNRNGKLTGVGEDHFVYNKYKGFYGVHASQQSYLFDDCKEVG

>CDS.004

MKIFIEYLLLIVSIAFVIDCIFTGVIRKVFSPVHDVVINALAIVLVFNSAFDVIKEVAA

>CDS.005

MRVLVRIVTSTVYDVFPVFMVKADGLNDEETDALIQRILVEYTGHDADSVMVDDDGVCWHNGNCWYVEETQQISDEDAEHLERILSISTFE



==> file_1.txt <==

HCM2.0015c  

HCM2.0122c  

HCM2.0104c  

HCM2.0069c  

HCM2.0078  

HCM2.0070c  

HCM2.0035c  

HCM2.0041c  

HCM2.0040c  

HCM2.0121c  

dave@dave-VirtualBox ~/Downloads/test $ grep -A 1 -f file_2.txt file_1.txt

dave@dave-VirtualBox ~/Downloads/test $

schneidz

10-31-2012 05:41 PM

^ is there still a typo ?

i dont see how for example the string "HCM2.0015c" exists in file_2.txt ?

cbtshare

10-31-2012 06:22 PM

I just created a mock similar test and this works:

Quote:

fgrep -f file2.txt file1.txt

Adzrules

10-31-2012 07:11 PM

Quote:

Originally Posted by schneidz (Post 4819263)

^ is there still a typo ?

i dont see how for example the string "HCM2.0015c" exists in file_2.txt ?

Oh for goodness sake! Again, I'm being retarded, wrong file AGAIN! Sorry.

Code:

dave@dave-VirtualBox ~/Downloads/test $ head file_2.txt file_1.txt

==> file_2.txt <==

>CDS.001

MSENGNKNIAIVEAFSETDKKTGEVVTLVPNTNNTVQPVALMRLGLFVPTLKSTSRGRKGQMVSMDASAELKQLSLAKAEGYEDIRISGLRLDMDNDFKTWVGIIHAFAKHKVVGDTVTLPFVEFVRLCGIPTARSSAKLRKRLDSSLSRIATNTISFRSKGSDEFYVTHLVQTAKYSVKHDTVELKADPKIFELYQFDKKVLLQLRAINELSRKESAQALYTFIESLPPDPAPISLARLRARLNLTSRTITQNATVRKAMEQLREIGYLDYTEVKRGNSVYFVIHYRRPKLRQAQISTKIDNDETEYSLPDENQDDIIDVVPDEKEGKMVMLSKEELALLEELRKAKTRK

>CDS.002

MALTLAGLEIEKTSGYWRAKGFKQPGILERLEREDGYIVHQRREWRMYNPETGKLTTKAGTLWGLLKKIH

>CDS.003

MKAIPFALLFLSSIVVADTTVYQCEMSVADVKNGALTDVIKAPYGAMVVDSGDQFYVVRDDRVLSSPYLTNRNGKLTGVGEDHFVYNKYKGFYGVHASQQSYLFDDCKEVG

>CDS.004

MKIFIEYLLLIVSIAFVIDCIFTGVIRKVFSPVHDVVINALAIVLVFNSAFDVIKEVAA

>CDS.005

MRVLVRIVTSTVYDVFPVFMVKADGLNDEETDALIQRILVEYTGHDADSVMVDDDGVCWHNGNCWYVEETQQISDEDAEHLERILSISTFE



==> file_1.txt <==

CDS.015  

CDS.117  

CDS.096  

CDS.060  

CDS.068  

CDS.061  

CDS.027  

CDS.031  

CDS.030  

CDS.116  

dave@dave-VirtualBox ~/Downloads/test $ 

dave@dave-VirtualBox ~/Downloads/test $ grep -A 1 -f file_2.txt file_1.txt

dave@dave-VirtualBox ~/Downloads/test $

That's what it actually looks like. Definitely this time!

Code:

fgrep -f file2.txt file1.txt > test1.txt

That doesn't seem to work for me. Again, it just creates a blank text file!

All times are GMT -5. The time now is 11:43 AM.

Page 1 of 2

Show 50 post(s) from this thread on one page