Search text file for records in another text file and pull extra data over to new
Hey guys,
I'm a massive newbie to the whole Linux scene but am enjoying the flexibility offered so far. However, I have searched for a while and cannot work out how to tackle this problem. I basically have two files: file_1.txt and file_2.txt file_1.txt contains a list of genes, with each line containing a new gene e.g.: CDS.002 CDS.005 CDS.035 etc. file_2.txt contains the original gene data in the following format: >CDS.001 MSENGNKNIAIVEAFSETDKKTGEVVTLVPNTNNTVQPVALMRLGLFVPTLKSTSRGR >CDS.002 MALTLAGLEIEKTSGYWRAKGFKQPGILERLE >CDS.003 MKAIPFALLFLSSIVVADTTVYQCEMSVADVKNGALTDVIKAPYGAMVVDSGDQFYVVRDDRVLSSPYLTNRNGKLTGVGEDHFVYNKYKGFYGVHASQQ SYLFDDCKEVG etc. Basically I want to search file_2.txt for all of the genes in file_1.txt. If found, I then want to put both the gene name (e.g. >CDS.002) AND the gene sequence (e.g. MALTLAGLEIEKTSGYWRAKGFKQPGILERLE) from file_2.txt into a new text file. I want all of the results in the final text file. So, I'd end up with a text file similar to this: >CDS.002 MSENGNKIAIVEAFSETDKKTGEVVTLVPNTNNTVQPVALMRLGLFVPTLKSTSRGR >CDS.005 MALTLAGLEIEKTSGYWRAKGFKQPGILERLEYVVRDDRVLSSPYLTNRNGKLTGVGEDHFVYN >CDS.035 MKAIPFALLFLSSIVVADTTVYQCEMSVADVKNGALTDVIKAPYGAMVVDSGDQFKYKGFYGVHASQQSYLFDDCKEVGFSETDKKTGEVVTLVPAGLEI EKTSGYWRAKGFKQPGILERLEYVVRDD etc. Now I have no idea how complicated this is for someone that knows their stuff! But for me, it's a little over my head at this stage. I'm very new to using the Terminal, and am only just starting out learning Perl. Any assistance would be greeted with a thousand thanks :) Thanks for any help! Adzrules |
this probably can be done in perl but i dont know how. you can probably bash it out with a while loop and grep (check the man page for the -A option).
|
Quote:
To add my 2 cents, with the grep -f option, you don't even need the loop |
Thanks for the replies guys. No problem if you don't want to, but could you give me an example of how fgrep would work in this instance? Having trouble finding the relevant info online in a format that I can understand!
|
^ it is hard to correct your mistake since you dont post what you tried and what error you are getting.
|
Quote:
So I've just used the following: for i in `cat file_1.txt`; do grep $i file_2.txt; done Which shows the correct genes, but how can I modify that to make it pull the gene sequence that is present on the lines below, but not the next gene name???? Any help is MUCH appreciated! |
Quote:
Code:
fgrep -A 1 -f file_2.txt file_1.txt > /whatever/floats/your/boat.txt |
Quote:
|
Quote:
Code:
head file_2.txt file_1.txt |
Sure, here's what I get:
Code:
dave@dave-VirtualBox ~/Downloads/test $ head file_2.txt file_1.txt |
in your first post you explain that file_1.txt contains a list of stuff like:
Quote:
Quote:
|
Quote:
Code:
dave@dave-VirtualBox ~/Downloads/test $ head file_2.txt file_1.txt |
^ is there still a typo ?
i dont see how for example the string "HCM2.0015c" exists in file_2.txt ? |
I just created a mock similar test and this works:
Quote:
|
Quote:
Code:
dave@dave-VirtualBox ~/Downloads/test $ head file_2.txt file_1.txt Code:
fgrep -f file2.txt file1.txt > test1.txt |
All times are GMT -5. The time now is 11:43 AM. |