eliminate data with similar tags

fs11 · 04-25-2008, 03:17 PM

Hello All,

I have a small problem that I would like to solve with C/C++ code.Currently, I am doing it with bash scripting, but it is slow for the application and does not always give desired results.

problem is :

file1

Code:

>seq 1
agagugaiuididhiuhfdiuhfiu
>seq 2
sfdesdfdsfsdfdsfsdfd
>seq 3
hiodfhiodhfiodoijfdoj
>seq 4
jfiodjiosfdiojfiodjojfsjdo

file 2:

Code:

>seq1
agagugai-u---ididhiu--hfdi--uhfiu

>seq 2
sf-desdfdsf-s-dfd-sfs-dfd

Result after running the program:

Code:

>seq 3
hiodfhiodhfiodoijfdoj
>seq 4
jfiodjiosfdiojfiodjojfsjdo

i.e. The "tags" that are common are eliminated with their corresponding data and is stored in another file...

Any help would be highly appreciable...
Thanks in Advance

fs11 · 04-25-2008, 05:43 PM

Any help

graemef · 04-25-2008, 06:30 PM

I think that you need to explain your needs a little better. Any advice about what you would require from the description given would just be a guess. Try and explain what you mean by a "tag" and outline the algorithm that you are after.

fs11 · 04-25-2008, 06:46 PM

By tag i mean ">name-of-data" at the start of every data.
The objective is that 2 files are read and the tags which are in common b/w the 2 files along with it data will be eliminated. The new result is written in newfile.

forexample:
In the above example, file1 contains seq {1234} and file2 has seq {12}. Therefore the 2 tags common b/w 2 files are seq {12}, though the data is bit different. These 2 tags along with their data is eliminated and is written in a newfile.

fs11 · 04-26-2008, 01:56 PM

Any Help

osor · 04-26-2008, 08:12 PM

Quote:

Originally Posted by fs11

Any Help

What kind of help are you looking for? We are not a program-writing service. There are, however, a few ways to do this conceptually. In terms of implementation, it takes around 5 lines in awk or perl, but it will take a few more in C++ and quite a bit more in C.

One way to do it is this:

Read each record into memory.
Separate each record into a “tag” and a “data” portion.
In an associative array variable, if the tag used as a key already has an entry, assign a null value to that entry (but don’t delete it completely). Otherwise, create a new entry with the tag as the key and the data (or pointer to data) as the value.
After you are done reading records, iterate through the associative array and print the tag and data only if the data is non-null.

In many scripting and “glue” languages (awk, perl, python, etc.) associative arrays are built into the language. In C++, you can use std::map. In C, you do this manually with a hash function. The std::map version will probably not be terribly faster than the scripted versions (if done correctly).

fs11 · 04-27-2008, 04:01 AM

Thanks for the pointers ...

osor · 04-27-2008, 12:23 PM

Quote:

Originally Posted by osor

Otherwise, create a new entry with the tag as the key and the data (or pointer to data) as the value.

I neglected to mention a two-pass variation on the above. In the first pass, the entry value is either set to true or false depending on whether it is the first or subsequent sighting of the tag. Then, on the second pass, the truth of the entry associated with a tag is consulted before deciding whether to print the tag and data.

The two-pass version will use less allocated memory, and will thus be faster in some cases (e.g., if you have very large files). On the other hand, you cannot use a two-pass version when you don’t have a seekable file (e.g., you are getting input from a pipe or something).

makyo · 04-27-2008, 01:00 PM

Hi.

My reading of the problem is that an entry will always be printed: if a singleton, then written to (say) STDOUT, if a multiple, then to a different file. This is based on the OP phrase:

Quote:

Originally Posted by fs11

The "tags" that are common are eliminated with their corresponding data and is stored in another file...

I have not thought through the idea of a second pass, but if the input is coming from STDIN, then one could test for that case and copy the input to a temporary file to prepare for the second pass ... cheers, makyo

fs11 · 04-27-2008, 06:22 PM

Hello again,

I have started writing the code.The idea is to make a (key,data) mapping for the two files and then one of them may be iterated to do the further processing.

However, I am having a problem, that I am not able to solve after much effort.

Code:

#include <iostream>
#include <map>
#include <string>
#include <iomanip>
#include <fstream>
#include <vector>



using namespace std;

typedef std::map<std::string, std::string> TStrStrMap;
typedef std::pair<std::string, std::string> TStrStrPair;

int main(int argc, char *argv[])
{
        TStrStrMap tMap;

        vector<string> stringVector;
        string x,id,mystring,check,mystring1="";
        int count=-1,count1=0;
        ifstream inFile;

        inFile.open("testfile");
        if (!inFile) {
            cout << "Unable to open file";
            exit(1); // terminate with error
            }

        while (inFile >> x) {
            count++;
            stringVector.push_back(x);           //putting kmer value in vector
            //cout<< x << endl;

            mystring=stringVector[count];

            check=mystring.substr(0, 1);



           if (check ==">" ){         //if first char is the > then it is the name of the sequence
              id=mystring;
              cout << mystring << endl;
              //mystring1="";
              tMap.insert(TStrStrPair(id, mystring1));
              mystring1="";
              }


           else{
                mystring1=mystring1+x;
                cout << mystring1 << endl;

                }

####code truncated

The problem is that the mapping is not quite accurate.

for example, if the file has data

Code:

>SEQ2
AVVHGGFDGGGASEPAVEDQQYSAA
dfdfdsafdfadfsafdsaf
dfadsfsafdafdsafafdfsfs
dfadfsafdsfdsafasffdfdfdsafd
>SEQ3
RVVHGLLDGGGANEPAVEKQRYSPN
dfdafdafdfdfdfdafdfdfd
fdfdfdafdafdafafdfdfdsfd
>SEQ5
RVVSGVLDGPAANEPIIEMEKYSPN

>SEQ6
RVVSGVPEGSAAHAPIVEKDQTTPN

>SEQ10
KVVSGLLDGPSAHEPINPLETYESS

>SEQ12
RVVNGLLDGGTANQPIIERQKYSPH

>SEQ13
RVGSALLDGGTVDQPIIERQKYSPH

>SEQ15
RVGSALLNGGHVSQPIIAKQRYTPE

>SEQ17
RLVHSLLPGGTGNQPEIVRQQYAPH

The mapping is such that the (key,data) is inaccurate such that if >seq15 is seeked...the data for >seq13(i.e.RVGSALLDGGTVDQPIIERQKYSPH )is reported.

Any pointers would be highly appreciable.
Thanks

osor · 04-27-2008, 09:10 PM

Quote:

Originally Posted by fs11

Any pointers would be highly appreciable.

I can’t seem to find the exact error, but you are making it too complicated (and verbose). Here is what your loop might look like.

Code:

	std::string key, data;
	size_t pos;
	while(std::getline(inFile, data, '>')) {
		//skip the trivial or malformed cases
		if((pos=data.find('\n')) == std::string::npos)
			continue;

		key = data.substr(0, pos);
		if(tMap.find(key) == tMap.end())
			tMap[key] = data;
		else
			tMap[key] = "";
	}

Then, to print out the data, you could do

Code:

	TStrStrMap::iterator tMapIt;
	for(tMapIt = tMap.begin(); tMapIt != tMap.end(); tMapIt++) {
		std::string data = (*tMapIt).second;
		if(!data.empty())
			std::cout << '<' << data;
	}

fs11 · 04-29-2008, 03:46 PM

Thanks alot.I have got the program running ....

thanks again..