[SOLVED] Search two text documents, eliminate matches, print third document with unmatched+.

ramma · 07-10-2012, 02:09 PM

Simple put, I'm trying to get a script to take two documents and print a third document based on what does not match between the first two.

One document is just identifiers and some gibberish that should be ignored, like this.

Code:

IDENT001  gibberish gibberish
IDENT002  gibberish gibberish
IDENT003  gibberish gibberish
IDENT004  gibberish gibberish

The second document is a bit more complicated with an identifier, and up to a few thousand characters following, plus a lot of gibberish surrounding the identifier. Each line of interest with the identifier starts with a >. It looks something like this.

Code:

>somethingrandom_IDENT001 morerandom
ATCG up to a few thousand characters in length
>somethingrandom_IDENT002 morerandom
ATCG up to a few thousand characters in length

What I need to do then is match the identifiers from the first file to identifiers in the second file and then list all non-matched content into a new file, including everything between a set of >'s for non-matches.

I hope that makes sense. Thanks for any help.

Kustom42 · 07-10-2012, 02:21 PM

Have you looked at just using the "diff" command?

---------- Post added 07-10-12 at 12:21 PM ----------

ss64.com/bash/diff.html

danielbmartin · 07-10-2012, 07:51 PM

File1 contains...

Code:

IDENT001  gibberish gibberish
IDENT002  gibberish gibberish
IDENT003  gibberish gibberish
IDENT004  gibberish gibberish

Question: Do the identifiers always begin with IDENT, or could they be any character string?
Question: Are all identifiers of the same length?
Question: What is between the identifier and the gibberish? Tab character? Blank character? More than one of either?

File2 contains...

Code:

>somethingrandom_IDENT001 morerandom
ATCG up to a few thousand characters in length
>somethingrandom_IDENT002 morerandom
ATCG up to a few thousand characters in length

Question: Is the identifier always in the same character position? If so, what is that position?
Question: Does the identifier always lie between an underscore and a blank? If so, do you guarantee that "somethingrandom" does not contain an underscore?

Daniel B. Martin

grail · 07-10-2012, 10:06 PM

Daniel has asked some good qualifying questions, I would add, what have you done in an attempt to solve this?

I would also add, have you searched these forums or google at all? i ask this as there are a number of solutions that I have seen dealing with ATCG (not sure the field name)
type data including lines starting with >. Try searching for questions by the user kmkocot

ramma · 07-11-2012, 12:27 PM

Quote:

Originally Posted by danielbmartin

File1 contains...

Code:

IDENT001  gibberish gibberish
IDENT002  gibberish gibberish
IDENT003  gibberish gibberish
IDENT004  gibberish gibberish

Question: Do the identifiers always begin with IDENT, or could they be any character string?
Question: Are all identifiers of the same length?
Question: What is between the identifier and the gibberish? Tab character? Blank character? More than one of either?

File2 contains...

Code:

>somethingrandom_IDENT001 morerandom
ATCG up to a few thousand characters in length
>somethingrandom_IDENT002 morerandom
ATCG up to a few thousand characters in length

Question: Is the identifier always in the same character position? If so, what is that position?
Question: Does the identifier always lie between an underscore and a blank? If so, do you guarantee that "somethingrandom" does not contain an underscore?

Daniel B. Martin

Question: Do the identifiers always begin with IDENT, or could they be any character string? They always begin with an identifier, the only thing that changes is the number which counts from 1~2700
Question: Are all identifiers of the same length? They are the same length.
Question: What is between the identifier and the gibberish? Tab character? Blank character? More than one of either? It's tab delimited between the identifier and the gibberish.

Question: Is the identifier always in the same character position? If so, what is that position? It is the in same position, the specific line of interest looks like this ">canFam2_ensGene_ENSCAFT00000001581"
Question: Does the identifier always lie between an underscore and a blank? If so, do you guarantee that "somethingrandom" does not contain an underscore? Yes, but before the identifier there are underscores, and shown above.

ramma · 07-11-2012, 12:30 PM

Quote:

Originally Posted by grail

Daniel has asked some good qualifying questions, I would add, what have you done in an attempt to solve this?

I would also add, have you searched these forums or google at all? i ask this as there are a number of solutions that I have seen dealing with ATCG (not sure the field name)
type data including lines starting with >. Try searching for questions by the user kmkocot

I've done a number of google searches and these forum searches. I'm not very familiar with text file manipulation, so much of what I found was hard for me to understand how to adapt to this specific situation. Many people were interested in just viewing exact matches as well, but I couldn't find a solution that was also interested in keeping a section of text following a match to non-match in the output. Thanks for pointing me to that user, I'll have a look.

Update:
I've been working a little with awk to try and compare the files. I removed all unwanted data from the first file, so now it looks as follows. I may also remove the sherwood_1u_1 part as it's not longer needed.

Code:

canFam2_ensGene_ENSCAFT00000000008__sherwood_1u__1
canFam2_ensGene_ENSCAFT00000000009__sherwood_1u__1
canFam2_ensGene_ENSCAFT00000000011__sherwood_1u__1

In trying to trim down the other file with awk I ran into a problem. I can't seem to get the sequence data following the header to be identified as one field by awk. Instead each line is a separate output, which won't quite work as there is a variable number of lines per entry, such as shown below.

Code:

>somethingrandom_IDENT001 morerandom
ATCGATCGATCGATCGATCG
ATCGATCGATCGATCGATCG
ATCGATCGATCGATCGATCG
>somethingrandom_IDENT002 morerandom
ATCGATCGATCGATCGATCG
ATCGATCGATCGATCGATCG
ATCGATCGATCGATCGATCG

Thanks again for any help.

ramma · 07-11-2012, 02:46 PM

Quote:

Originally Posted by ramma

In trying to trim down the other file with awk I ran into a problem. I can't seem to get the sequence data following the header to be identified as one field by awk. Instead each line is a separate output, which won't quite work as there is a variable number of lines per entry, such as shown below.

Code:

>somethingrandom_IDENT001 morerandom
ATCGATCGATCGATCGATCG
ATCGATCGATCGATCGATCG
ATCGATCGATCGATCGATCG
>somethingrandom_IDENT002 morerandom
ATCGATCGATCGATCGATCG
ATCGATCGATCGATCGATCG
ATCGATCGATCGATCGATCG

Thanks again for any help.

Managed to solve this. Setting FS = "none\n" and RS = ">" did it somehow.

danielbmartin · 07-11-2012, 03:10 PM

You are simplifying the input files, and that's good.
We are shooting at a moving target, and that's bad.
When you settle down with a stable pair of input files,
post what they look like.

In the sample files already shown the identifiers appear to be in sorted order. May we count in that?

Daniel B. Martin

ramma · 07-11-2012, 04:39 PM

Quote:

Originally Posted by danielbmartin

You are simplifying the input files, and that's good.
We are shooting at a moving target, and that's bad.
When you settle down with a stable pair of input files,
post what they look like.

In the sample files already shown the identifiers appear to be in sorted order. May we count in that?

Daniel B. Martin

For the file containing just identifiers, they are in order. For the second file with sequences following the identifiers they are in a random order.

danielbmartin · 07-11-2012, 05:07 PM

Quote:

Originally Posted by ramma

For the file containing just identifiers, they are in order. For the second file with sequences following the identifiers they are in a random order.

Thank you for that clarification.

Are there any characters which your input files are guaranteed to not have? Tilde (~) or Backtick (`) perhaps?

When you settle down with a stable pair of input files, post what they look like.

Daniel B. Martin

ramma · 07-11-2012, 05:21 PM

Quote:

Originally Posted by danielbmartin

Thank you for that clarification.

Are there any characters which your input files are guaranteed to not have? Tilde (~) or Backtick (`) perhaps?

When you settle down with a stable pair of input files, post what they look like.

Daniel B. Martin

This is what they'll look like, copied from the files, trimmed down and numbers changed to work for the example. I can't find an easy way to sort out the excess content in the second file, so I'm just leaving it for now.

File 1

Code:

>canFam2_ensGene_ENSCAFT00000000008 range=chr1:67099213-67141812 5'pad=0 3'pad=0 strand=+ repeatMasking=none
ACGCAGGTACAACTTGGCCAAGTAGAAATCAAATGCCCTATCACAGAATG
TTTTGAATTCCTGGAAGAAAGAACGATTGTCTTTAACTTAACACATGAAG
ATTCCATCAAATATAAGTACTTCTTGGAACTTGGCCGTATCGATTCCAGC
>canFam2_ensGene_ENSCAFT000000000002 range=chr1:24823786-25526201 5'pad=0 3'pad=0 strand=- repeatMasking=none
TTTATTTCCCTGTGTTTCCTTGTTGCAGGTTTCCAGATTAAGCCTTTCAC
ATCACTGCACTTCCTGTCCGAGCCTTCTGATGCTGTCACCATGCGGGGAG
>canFam2_ensGene_ENSCAFT00000000001 range=chr1:75336645-75504057 5'pad=0 3'pad=0 strand=- repeatMasking=none
CGGTCTATCATGACTGTGTTCAGGCAGGAAAACGTGGATGATTACTACGA
CACCGGCGAGGAGCTCGGCAGTGGGCAGTTTGCAGTCGTGAAGAAATGCC
GGGAGAAGAGCACTGGTCTCCAGTATGCCGCCAAGTTTATCAAGAAGAGG
CGGACGAAATCCAGCCGGCGGGGCGTGAGCCGCGAGGACATCGAGCGGGA
>canFam2_ensGene_ENSCAFT00000000009 range=chr1:117371505-117483397 5'pad=0 3'pad=0 strand=- repeatMasking=none
GTGTCTGCGGGAGCCCTGACCCCACAGTCTGCTCGCAGGATGATGAAGTG

File 2

Code:

canFam2_ensGene_ENSCAFT00000000008__sherwood_1u__1
canFam2_ensGene_ENSCAFT00000000009__sherwood_1u__1

Ideal output - The two which matched were not copied to output, while the two unmatched were copied plus their sequence.

Code:

>canFam2_ensGene_ENSCAFT000000000002 range=chr1:24823786-25526201 5'pad=0 3'pad=0 strand=- repeatMasking=none
TTTATTTCCCTGTGTTTCCTTGTTGCAGGTTTCCAGATTAAGCCTTTCAC
ATCACTGCACTTCCTGTCCGAGCCTTCTGATGCTGTCACCATGCGGGGAG
>canFam2_ensGene_ENSCAFT00000000001 range=chr1:75336645-75504057 5'pad=0 3'pad=0 strand=- repeatMasking=none
CGGTCTATCATGACTGTGTTCAGGCAGGAAAACGTGGATGATTACTACGA
CACCGGCGAGGAGCTCGGCAGTGGGCAGTTTGCAGTCGTGAAGAAATGCC
GGGAGAAGAGCACTGGTCTCCAGTATGCCGCCAAGTTTATCAAGAAGAGG
CGGACGAAATCCAGCCGGCGGGGCGTGAGCCGCGAGGACATCGAGCGGGA

danielbmartin · 07-11-2012, 05:26 PM

Now you have nailed down the content and format of two test files. Good.

The goal is to identify mismatches based on the Identifier.
There are two possible kinds of mismatches:
- present in File1 and absent in File2
- present in File2 and absent in File1

Are you interested in both kinds, or only one?

Daniel B. Martin

ramma · 07-11-2012, 05:31 PM

Every line in file 2 should match to a line in file 1. So mismatches which are present in file 1, along with the sequence data that follows, but absent in file 2.

danielbmartin · 07-11-2012, 08:47 PM

A proposed solution...

Code:

sed -r 's/__/~/1' $InFile1 \
|cut -d~ -f1               \
|sed 's/^/^>/'             \
|grep -v -f - $InFile2     \
> $OutFile

Results generated by this proposed solution...

Code:

>canFam2_ensGene_ENSCAFT000000000002 range=chr1:24823786-25526201 5'pad=0 3'pad=0 strand=- repeatMasking=noneTTTATTTCCCTGTGTTTCCTTGTTGCAGGTTTCCAGATTAAGCCTTTCACATCACTGCACTTCCTGTCCGAGCCTTCTGATGCTGTCACCATGCGGGGAG
>canFam2_ensGene_ENSCAFT00000000001 range=chr1:75336645-75504057 5'pad=0 3'pad=0 strand=- repeatMasking=noneCGGTCTATCATGACTGTGTTCAGGCAGGAAAACGTGGATGATTACTACGACACCGGCGAGGAGCTCGGCAGTGGGCAGTTTGCAGTCGTGAAGAAATGCCGGGAGAAGAGCACTGGTCTCCAGTATGCCGCCAAGTTTATCAAGAAGAGGCGGACGAAATCCAGCCGGCGGGGCGTGAGCCGCGAGGACATCGAGCGGGA

Daniel B. Martin

grail · 07-11-2012, 11:58 PM

I wonder if I could be a pain and ask for the 2 original files, dodgy data is fine as long as the format is exactly the same.

The reason I ask is your newly formed and trimmed data still requires multiple manipulation of fields to coerce them towards the output.