[SOLVED] Search two text documents, eliminate matches, print third document with unmatched+.
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
The second document is a bit more complicated with an identifier, and up to a few thousand characters following, plus a lot of gibberish surrounding the identifier. Each line of interest with the identifier starts with a >. It looks something like this.
Code:
>somethingrandom_IDENT001 morerandom
ATCG up to a few thousand characters in length
>somethingrandom_IDENT002 morerandom
ATCG up to a few thousand characters in length
What I need to do then is match the identifiers from the first file to identifiers in the second file and then list all non-matched content into a new file, including everything between a set of >'s for non-matches.
Question: Do the identifiers always begin with IDENT, or could they be any character string?
Question: Are all identifiers of the same length?
Question: What is between the identifier and the gibberish? Tab character? Blank character? More than one of either?
File2 contains...
Code:
>somethingrandom_IDENT001 morerandom
ATCG up to a few thousand characters in length
>somethingrandom_IDENT002 morerandom
ATCG up to a few thousand characters in length
Question: Is the identifier always in the same character position? If so, what is that position?
Question: Does the identifier always lie between an underscore and a blank? If so, do you guarantee that "somethingrandom" does not contain an underscore?
Daniel has asked some good qualifying questions, I would add, what have you done in an attempt to solve this?
I would also add, have you searched these forums or google at all? i ask this as there are a number of solutions that I have seen dealing with ATCG (not sure the field name)
type data including lines starting with >. Try searching for questions by the user kmkocot
Question: Do the identifiers always begin with IDENT, or could they be any character string?
Question: Are all identifiers of the same length?
Question: What is between the identifier and the gibberish? Tab character? Blank character? More than one of either?
File2 contains...
Code:
>somethingrandom_IDENT001 morerandom
ATCG up to a few thousand characters in length
>somethingrandom_IDENT002 morerandom
ATCG up to a few thousand characters in length
Question: Is the identifier always in the same character position? If so, what is that position?
Question: Does the identifier always lie between an underscore and a blank? If so, do you guarantee that "somethingrandom" does not contain an underscore?
Daniel B. Martin
Question: Do the identifiers always begin with IDENT, or could they be any character string? They always begin with an identifier, the only thing that changes is the number which counts from 1~2700
Question: Are all identifiers of the same length? They are the same length.
Question: What is between the identifier and the gibberish? Tab character? Blank character? More than one of either? It's tab delimited between the identifier and the gibberish.
Question: Is the identifier always in the same character position? If so, what is that position? It is the in same position, the specific line of interest looks like this ">canFam2_ensGene_ENSCAFT00000001581"
Question: Does the identifier always lie between an underscore and a blank? If so, do you guarantee that "somethingrandom" does not contain an underscore? Yes, but before the identifier there are underscores, and shown above.
Daniel has asked some good qualifying questions, I would add, what have you done in an attempt to solve this?
I would also add, have you searched these forums or google at all? i ask this as there are a number of solutions that I have seen dealing with ATCG (not sure the field name)
type data including lines starting with >. Try searching for questions by the user kmkocot
I've done a number of google searches and these forum searches. I'm not very familiar with text file manipulation, so much of what I found was hard for me to understand how to adapt to this specific situation. Many people were interested in just viewing exact matches as well, but I couldn't find a solution that was also interested in keeping a section of text following a match to non-match in the output. Thanks for pointing me to that user, I'll have a look.
Update:
I've been working a little with awk to try and compare the files. I removed all unwanted data from the first file, so now it looks as follows. I may also remove the sherwood_1u_1 part as it's not longer needed.
In trying to trim down the other file with awk I ran into a problem. I can't seem to get the sequence data following the header to be identified as one field by awk. Instead each line is a separate output, which won't quite work as there is a variable number of lines per entry, such as shown below.
In trying to trim down the other file with awk I ran into a problem. I can't seem to get the sequence data following the header to be identified as one field by awk. Instead each line is a separate output, which won't quite work as there is a variable number of lines per entry, such as shown below.
You are simplifying the input files, and that's good.
We are shooting at a moving target, and that's bad.
When you settle down with a stable pair of input files,
post what they look like.
In the sample files already shown the identifiers appear to be in sorted order. May we count in that?
You are simplifying the input files, and that's good.
We are shooting at a moving target, and that's bad.
When you settle down with a stable pair of input files,
post what they look like.
In the sample files already shown the identifiers appear to be in sorted order. May we count in that?
Daniel B. Martin
For the file containing just identifiers, they are in order. For the second file with sequences following the identifiers they are in a random order.
Are there any characters which your input files are guaranteed to not have? Tilde (~) or Backtick (`) perhaps?
When you settle down with a stable pair of input files, post what they look like.
Daniel B. Martin
This is what they'll look like, copied from the files, trimmed down and numbers changed to work for the example. I can't find an easy way to sort out the excess content in the second file, so I'm just leaving it for now.
Now you have nailed down the content and format of two test files. Good.
The goal is to identify mismatches based on the Identifier.
There are two possible kinds of mismatches:
- present in File1 and absent in File2
- present in File2 and absent in File1
Every line in file 2 should match to a line in file 1. So mismatches which are present in file 1, along with the sequence data that follows, but absent in file 2.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.