LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 07-10-2012, 02:09 PM   #1
ramma
LQ Newbie
 
Registered: Jul 2012
Posts: 8

Rep: Reputation: Disabled
Search two text documents, eliminate matches, print third document with unmatched+.


Simple put, I'm trying to get a script to take two documents and print a third document based on what does not match between the first two.

One document is just identifiers and some gibberish that should be ignored, like this.

Code:
IDENT001  gibberish gibberish
IDENT002  gibberish gibberish
IDENT003  gibberish gibberish
IDENT004  gibberish gibberish
The second document is a bit more complicated with an identifier, and up to a few thousand characters following, plus a lot of gibberish surrounding the identifier. Each line of interest with the identifier starts with a >. It looks something like this.

Code:
>somethingrandom_IDENT001 morerandom
ATCG up to a few thousand characters in length
>somethingrandom_IDENT002 morerandom
ATCG up to a few thousand characters in length
What I need to do then is match the identifiers from the first file to identifiers in the second file and then list all non-matched content into a new file, including everything between a set of >'s for non-matches.

I hope that makes sense. Thanks for any help.

Last edited by ramma; 07-12-2012 at 12:56 PM.
 
Old 07-10-2012, 02:21 PM   #2
Kustom42
Senior Member
 
Registered: Mar 2012
Distribution: Red Hat
Posts: 1,604

Rep: Reputation: 415Reputation: 415Reputation: 415Reputation: 415Reputation: 415
Have you looked at just using the "diff" command?

---------- Post added 07-10-12 at 12:21 PM ----------

ss64.com/bash/diff.html
 
Old 07-10-2012, 07:51 PM   #3
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
File1 contains...
Code:
IDENT001  gibberish gibberish
IDENT002  gibberish gibberish
IDENT003  gibberish gibberish
IDENT004  gibberish gibberish
Question: Do the identifiers always begin with IDENT, or could they be any character string?
Question: Are all identifiers of the same length?
Question: What is between the identifier and the gibberish? Tab character? Blank character? More than one of either?

File2 contains...
Code:
>somethingrandom_IDENT001 morerandom
ATCG up to a few thousand characters in length
>somethingrandom_IDENT002 morerandom
ATCG up to a few thousand characters in length
Question: Is the identifier always in the same character position? If so, what is that position?
Question: Does the identifier always lie between an underscore and a blank? If so, do you guarantee that "somethingrandom" does not contain an underscore?

Daniel B. Martin
 
Old 07-10-2012, 10:06 PM   #4
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,007

Rep: Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192
Daniel has asked some good qualifying questions, I would add, what have you done in an attempt to solve this?

I would also add, have you searched these forums or google at all? i ask this as there are a number of solutions that I have seen dealing with ATCG (not sure the field name)
type data including lines starting with >. Try searching for questions by the user kmkocot
 
Old 07-11-2012, 12:27 PM   #5
ramma
LQ Newbie
 
Registered: Jul 2012
Posts: 8

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by danielbmartin View Post
File1 contains...
Code:
IDENT001  gibberish gibberish
IDENT002  gibberish gibberish
IDENT003  gibberish gibberish
IDENT004  gibberish gibberish
Question: Do the identifiers always begin with IDENT, or could they be any character string?
Question: Are all identifiers of the same length?
Question: What is between the identifier and the gibberish? Tab character? Blank character? More than one of either?

File2 contains...
Code:
>somethingrandom_IDENT001 morerandom
ATCG up to a few thousand characters in length
>somethingrandom_IDENT002 morerandom
ATCG up to a few thousand characters in length
Question: Is the identifier always in the same character position? If so, what is that position?
Question: Does the identifier always lie between an underscore and a blank? If so, do you guarantee that "somethingrandom" does not contain an underscore?

Daniel B. Martin
Question: Do the identifiers always begin with IDENT, or could they be any character string? They always begin with an identifier, the only thing that changes is the number which counts from 1~2700
Question: Are all identifiers of the same length? They are the same length.
Question: What is between the identifier and the gibberish? Tab character? Blank character? More than one of either? It's tab delimited between the identifier and the gibberish.

Question: Is the identifier always in the same character position? If so, what is that position? It is the in same position, the specific line of interest looks like this ">canFam2_ensGene_ENSCAFT00000001581"
Question: Does the identifier always lie between an underscore and a blank? If so, do you guarantee that "somethingrandom" does not contain an underscore? Yes, but before the identifier there are underscores, and shown above.
 
Old 07-11-2012, 12:30 PM   #6
ramma
LQ Newbie
 
Registered: Jul 2012
Posts: 8

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by grail View Post
Daniel has asked some good qualifying questions, I would add, what have you done in an attempt to solve this?

I would also add, have you searched these forums or google at all? i ask this as there are a number of solutions that I have seen dealing with ATCG (not sure the field name)
type data including lines starting with >. Try searching for questions by the user kmkocot
I've done a number of google searches and these forum searches. I'm not very familiar with text file manipulation, so much of what I found was hard for me to understand how to adapt to this specific situation. Many people were interested in just viewing exact matches as well, but I couldn't find a solution that was also interested in keeping a section of text following a match to non-match in the output. Thanks for pointing me to that user, I'll have a look.

Update:
I've been working a little with awk to try and compare the files. I removed all unwanted data from the first file, so now it looks as follows. I may also remove the sherwood_1u_1 part as it's not longer needed.
Code:
canFam2_ensGene_ENSCAFT00000000008__sherwood_1u__1
canFam2_ensGene_ENSCAFT00000000009__sherwood_1u__1
canFam2_ensGene_ENSCAFT00000000011__sherwood_1u__1
In trying to trim down the other file with awk I ran into a problem. I can't seem to get the sequence data following the header to be identified as one field by awk. Instead each line is a separate output, which won't quite work as there is a variable number of lines per entry, such as shown below.
Code:
>somethingrandom_IDENT001 morerandom
ATCGATCGATCGATCGATCG
ATCGATCGATCGATCGATCG
ATCGATCGATCGATCGATCG
>somethingrandom_IDENT002 morerandom
ATCGATCGATCGATCGATCG
ATCGATCGATCGATCGATCG
ATCGATCGATCGATCGATCG
Thanks again for any help.

Last edited by ramma; 07-11-2012 at 02:29 PM. Reason: Updates on what I've tried.
 
Old 07-11-2012, 02:46 PM   #7
ramma
LQ Newbie
 
Registered: Jul 2012
Posts: 8

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by ramma View Post
In trying to trim down the other file with awk I ran into a problem. I can't seem to get the sequence data following the header to be identified as one field by awk. Instead each line is a separate output, which won't quite work as there is a variable number of lines per entry, such as shown below.
Code:
>somethingrandom_IDENT001 morerandom
ATCGATCGATCGATCGATCG
ATCGATCGATCGATCGATCG
ATCGATCGATCGATCGATCG
>somethingrandom_IDENT002 morerandom
ATCGATCGATCGATCGATCG
ATCGATCGATCGATCGATCG
ATCGATCGATCGATCGATCG
Thanks again for any help.
Managed to solve this. Setting FS = "none\n" and RS = ">" did it somehow.
 
Old 07-11-2012, 03:10 PM   #8
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
You are simplifying the input files, and that's good.
We are shooting at a moving target, and that's bad.
When you settle down with a stable pair of input files,
post what they look like.

In the sample files already shown the identifiers appear to be in sorted order. May we count in that?

Daniel B. Martin
 
Old 07-11-2012, 04:39 PM   #9
ramma
LQ Newbie
 
Registered: Jul 2012
Posts: 8

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by danielbmartin View Post
You are simplifying the input files, and that's good.
We are shooting at a moving target, and that's bad.
When you settle down with a stable pair of input files,
post what they look like.

In the sample files already shown the identifiers appear to be in sorted order. May we count in that?

Daniel B. Martin
For the file containing just identifiers, they are in order. For the second file with sequences following the identifiers they are in a random order.
 
Old 07-11-2012, 05:07 PM   #10
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Quote:
Originally Posted by ramma View Post
For the file containing just identifiers, they are in order. For the second file with sequences following the identifiers they are in a random order.
Thank you for that clarification.

Are there any characters which your input files are guaranteed to not have? Tilde (~) or Backtick (`) perhaps?

When you settle down with a stable pair of input files, post what they look like.

Daniel B. Martin
 
Old 07-11-2012, 05:21 PM   #11
ramma
LQ Newbie
 
Registered: Jul 2012
Posts: 8

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by danielbmartin View Post
Thank you for that clarification.

Are there any characters which your input files are guaranteed to not have? Tilde (~) or Backtick (`) perhaps?

When you settle down with a stable pair of input files, post what they look like.

Daniel B. Martin
This is what they'll look like, copied from the files, trimmed down and numbers changed to work for the example. I can't find an easy way to sort out the excess content in the second file, so I'm just leaving it for now.

File 1
Code:
>canFam2_ensGene_ENSCAFT00000000008 range=chr1:67099213-67141812 5'pad=0 3'pad=0 strand=+ repeatMasking=none
ACGCAGGTACAACTTGGCCAAGTAGAAATCAAATGCCCTATCACAGAATG
TTTTGAATTCCTGGAAGAAAGAACGATTGTCTTTAACTTAACACATGAAG
ATTCCATCAAATATAAGTACTTCTTGGAACTTGGCCGTATCGATTCCAGC
>canFam2_ensGene_ENSCAFT000000000002 range=chr1:24823786-25526201 5'pad=0 3'pad=0 strand=- repeatMasking=none
TTTATTTCCCTGTGTTTCCTTGTTGCAGGTTTCCAGATTAAGCCTTTCAC
ATCACTGCACTTCCTGTCCGAGCCTTCTGATGCTGTCACCATGCGGGGAG
>canFam2_ensGene_ENSCAFT00000000001 range=chr1:75336645-75504057 5'pad=0 3'pad=0 strand=- repeatMasking=none
CGGTCTATCATGACTGTGTTCAGGCAGGAAAACGTGGATGATTACTACGA
CACCGGCGAGGAGCTCGGCAGTGGGCAGTTTGCAGTCGTGAAGAAATGCC
GGGAGAAGAGCACTGGTCTCCAGTATGCCGCCAAGTTTATCAAGAAGAGG
CGGACGAAATCCAGCCGGCGGGGCGTGAGCCGCGAGGACATCGAGCGGGA
>canFam2_ensGene_ENSCAFT00000000009 range=chr1:117371505-117483397 5'pad=0 3'pad=0 strand=- repeatMasking=none
GTGTCTGCGGGAGCCCTGACCCCACAGTCTGCTCGCAGGATGATGAAGTG
File 2
Code:
canFam2_ensGene_ENSCAFT00000000008__sherwood_1u__1
canFam2_ensGene_ENSCAFT00000000009__sherwood_1u__1
Ideal output - The two which matched were not copied to output, while the two unmatched were copied plus their sequence.
Code:
>canFam2_ensGene_ENSCAFT000000000002 range=chr1:24823786-25526201 5'pad=0 3'pad=0 strand=- repeatMasking=none
TTTATTTCCCTGTGTTTCCTTGTTGCAGGTTTCCAGATTAAGCCTTTCAC
ATCACTGCACTTCCTGTCCGAGCCTTCTGATGCTGTCACCATGCGGGGAG
>canFam2_ensGene_ENSCAFT00000000001 range=chr1:75336645-75504057 5'pad=0 3'pad=0 strand=- repeatMasking=none
CGGTCTATCATGACTGTGTTCAGGCAGGAAAACGTGGATGATTACTACGA
CACCGGCGAGGAGCTCGGCAGTGGGCAGTTTGCAGTCGTGAAGAAATGCC
GGGAGAAGAGCACTGGTCTCCAGTATGCCGCCAAGTTTATCAAGAAGAGG
CGGACGAAATCCAGCCGGCGGGGCGTGAGCCGCGAGGACATCGAGCGGGA
 
Old 07-11-2012, 05:26 PM   #12
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Now you have nailed down the content and format of two test files. Good.

The goal is to identify mismatches based on the Identifier.
There are two possible kinds of mismatches:
- present in File1 and absent in File2
- present in File2 and absent in File1

Are you interested in both kinds, or only one?

Daniel B. Martin
 
Old 07-11-2012, 05:31 PM   #13
ramma
LQ Newbie
 
Registered: Jul 2012
Posts: 8

Original Poster
Rep: Reputation: Disabled
Every line in file 2 should match to a line in file 1. So mismatches which are present in file 1, along with the sequence data that follows, but absent in file 2.
 
Old 07-11-2012, 08:47 PM   #14
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
A proposed solution...
Code:
sed -r 's/__/~/1' $InFile1 \
|cut -d~ -f1               \
|sed 's/^/^>/'             \
|grep -v -f - $InFile2     \
> $OutFile
Results generated by this proposed solution...
Code:
>canFam2_ensGene_ENSCAFT000000000002 range=chr1:24823786-25526201 5'pad=0 3'pad=0 strand=- repeatMasking=noneTTTATTTCCCTGTGTTTCCTTGTTGCAGGTTTCCAGATTAAGCCTTTCACATCACTGCACTTCCTGTCCGAGCCTTCTGATGCTGTCACCATGCGGGGAG
>canFam2_ensGene_ENSCAFT00000000001 range=chr1:75336645-75504057 5'pad=0 3'pad=0 strand=- repeatMasking=noneCGGTCTATCATGACTGTGTTCAGGCAGGAAAACGTGGATGATTACTACGACACCGGCGAGGAGCTCGGCAGTGGGCAGTTTGCAGTCGTGAAGAAATGCCGGGAGAAGAGCACTGGTCTCCAGTATGCCGCCAAGTTTATCAAGAAGAGGCGGACGAAATCCAGCCGGCGGGGCGTGAGCCGCGAGGACATCGAGCGGGA
Daniel B. Martin

Last edited by danielbmartin; 07-11-2012 at 09:13 PM. Reason: Correct t7po
 
Old 07-11-2012, 11:58 PM   #15
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,007

Rep: Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192
I wonder if I could be a pain and ask for the 2 original files, dodgy data is fine as long as the format is exactly the same.

The reason I ask is your newly formed and trimmed data still requires multiple manipulation of fields to coerce them towards the output.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
help with grep... only print specific string with multiple matches? trey85stang Linux - General 2 05-11-2009 09:29 AM
Regex Question: Only print part of line that matches TheMeteorPolice Programming 5 01-12-2006 01:21 PM
All documents print as text using lpr JohnKFT Slackware 1 04-12-2005 11:48 AM
what tool can search for documents containing certain text? jacksonscottsly Linux - Software 3 07-19-2004 01:44 AM
PHP: text search in StarOffice and OpenOffice documents, how to do it fast? J_Szucs Linux - General 1 11-22-2003 06:37 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 05:27 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration