LinuxQuestions.org
Review your favorite Linux distribution.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 04-25-2008, 03:17 PM   #1
fs11
Member
 
Registered: Aug 2006
Posts: 79

Rep: Reputation: 15
eliminate data with similar tags


Hello All,

I have a small problem that I would like to solve with C/C++ code.Currently, I am doing it with bash scripting, but it is slow for the application and does not always give desired results.

problem is :

file1
Code:
>seq 1
agagugaiuididhiuhfdiuhfiu
>seq 2
sfdesdfdsfsdfdsfsdfd
>seq 3
hiodfhiodhfiodoijfdoj
>seq 4
jfiodjiosfdiojfiodjojfsjdo

file 2:
Code:
>seq1
agagugai-u---ididhiu--hfdi--uhfiu

>seq 2
sf-desdfdsf-s-dfd-sfs-dfd

Result after running the program:
Code:
>seq 3
hiodfhiodhfiodoijfdoj
>seq 4
jfiodjiosfdiojfiodjojfsjdo

i.e. The "tags" that are common are eliminated with their corresponding data and is stored in another file...


Any help would be highly appreciable...
Thanks in Advance
 
Old 04-25-2008, 05:43 PM   #2
fs11
Member
 
Registered: Aug 2006
Posts: 79

Original Poster
Rep: Reputation: 15
Any help
 
Old 04-25-2008, 06:30 PM   #3
graemef
Senior Member
 
Registered: Nov 2005
Location: Hanoi
Distribution: Fedora 13, Ubuntu 10.04
Posts: 2,379

Rep: Reputation: 148Reputation: 148
I think that you need to explain your needs a little better. Any advice about what you would require from the description given would just be a guess. Try and explain what you mean by a "tag" and outline the algorithm that you are after.
 
Old 04-25-2008, 06:46 PM   #4
fs11
Member
 
Registered: Aug 2006
Posts: 79

Original Poster
Rep: Reputation: 15
By tag i mean ">name-of-data" at the start of every data.
The objective is that 2 files are read and the tags which are in common b/w the 2 files along with it data will be eliminated. The new result is written in newfile.

forexample:
In the above example, file1 contains seq {1234} and file2 has seq {12}. Therefore the 2 tags common b/w 2 files are seq {12}, though the data is bit different. These 2 tags along with their data is eliminated and is written in a newfile.
 
Old 04-26-2008, 01:56 PM   #5
fs11
Member
 
Registered: Aug 2006
Posts: 79

Original Poster
Rep: Reputation: 15
Any Help
 
Old 04-26-2008, 08:12 PM   #6
osor
HCL Maintainer
 
Registered: Jan 2006
Distribution: (H)LFS, Gentoo
Posts: 2,450

Rep: Reputation: 78
Quote:
Originally Posted by fs11 View Post
Any Help
What kind of help are you looking for? We are not a program-writing service. There are, however, a few ways to do this conceptually. In terms of implementation, it takes around 5 lines in awk or perl, but it will take a few more in C++ and quite a bit more in C.

One way to do it is this:
  • Read each record into memory.
  • Separate each record into a “tag” and a “data” portion.
  • In an associative array variable, if the tag used as a key already has an entry, assign a null value to that entry (but don’t delete it completely). Otherwise, create a new entry with the tag as the key and the data (or pointer to data) as the value.
  • After you are done reading records, iterate through the associative array and print the tag and data only if the data is non-null.
In many scripting and “glue” languages (awk, perl, python, etc.) associative arrays are built into the language. In C++, you can use std::map. In C, you do this manually with a hash function. The std::map version will probably not be terribly faster than the scripted versions (if done correctly).
 
Old 04-27-2008, 04:01 AM   #7
fs11
Member
 
Registered: Aug 2006
Posts: 79

Original Poster
Rep: Reputation: 15
Thanks for the pointers ...
 
Old 04-27-2008, 12:23 PM   #8
osor
HCL Maintainer
 
Registered: Jan 2006
Distribution: (H)LFS, Gentoo
Posts: 2,450

Rep: Reputation: 78
Quote:
Originally Posted by osor View Post
Otherwise, create a new entry with the tag as the key and the data (or pointer to data) as the value.
I neglected to mention a two-pass variation on the above. In the first pass, the entry value is either set to true or false depending on whether it is the first or subsequent sighting of the tag. Then, on the second pass, the truth of the entry associated with a tag is consulted before deciding whether to print the tag and data.

The two-pass version will use less allocated memory, and will thus be faster in some cases (e.g., if you have very large files). On the other hand, you cannot use a two-pass version when you don’t have a seekable file (e.g., you are getting input from a pipe or something).
 
Old 04-27-2008, 01:00 PM   #9
makyo
Member
 
Registered: Aug 2006
Location: Saint Paul, MN, USA
Distribution: {Free,Open}BSD, CentOS, Debian, Fedora, Solaris, SuSE
Posts: 735

Rep: Reputation: 76
Hi.

My reading of the problem is that an entry will always be printed: if a singleton, then written to (say) STDOUT, if a multiple, then to a different file. This is based on the OP phrase:
Quote:
Originally Posted by fs11
The "tags" that are common are eliminated with their corresponding data and is stored in another file...
I have not thought through the idea of a second pass, but if the input is coming from STDIN, then one could test for that case and copy the input to a temporary file to prepare for the second pass ... cheers, makyo
 
Old 04-27-2008, 06:22 PM   #10
fs11
Member
 
Registered: Aug 2006
Posts: 79

Original Poster
Rep: Reputation: 15
Hello again,

I have started writing the code.The idea is to make a (key,data) mapping for the two files and then one of them may be iterated to do the further processing.

However, I am having a problem, that I am not able to solve after much effort.


Code:
#include <iostream>
#include <map>
#include <string>
#include <iomanip>
#include <fstream>
#include <vector>



using namespace std;

typedef std::map<std::string, std::string> TStrStrMap;
typedef std::pair<std::string, std::string> TStrStrPair;

int main(int argc, char *argv[])
{
        TStrStrMap tMap;

        vector<string> stringVector;
        string x,id,mystring,check,mystring1="";
        int count=-1,count1=0;
        ifstream inFile;

        inFile.open("testfile");
        if (!inFile) {
            cout << "Unable to open file";
            exit(1); // terminate with error
            }

        while (inFile >> x) {
            count++;
            stringVector.push_back(x);           //putting kmer value in vector
            //cout<< x << endl;

            mystring=stringVector[count];

            check=mystring.substr(0, 1);



           if (check ==">" ){         //if first char is the > then it is the name of the sequence
              id=mystring;
              cout << mystring << endl;
              //mystring1="";
              tMap.insert(TStrStrPair(id, mystring1));
              mystring1="";
              }


           else{
                mystring1=mystring1+x;
                cout << mystring1 << endl;

                }

####code truncated

The problem is that the mapping is not quite accurate.

for example, if the file has data

Code:
>SEQ2
AVVHGGFDGGGASEPAVEDQQYSAA
dfdfdsafdfadfsafdsaf
dfadsfsafdafdsafafdfsfs
dfadfsafdsfdsafasffdfdfdsafd
>SEQ3
RVVHGLLDGGGANEPAVEKQRYSPN
dfdafdafdfdfdfdafdfdfd
fdfdfdafdafdafafdfdfdsfd
>SEQ5
RVVSGVLDGPAANEPIIEMEKYSPN

>SEQ6
RVVSGVPEGSAAHAPIVEKDQTTPN

>SEQ10
KVVSGLLDGPSAHEPINPLETYESS

>SEQ12
RVVNGLLDGGTANQPIIERQKYSPH

>SEQ13
RVGSALLDGGTVDQPIIERQKYSPH

>SEQ15
RVGSALLNGGHVSQPIIAKQRYTPE

>SEQ17
RLVHSLLPGGTGNQPEIVRQQYAPH

The mapping is such that the (key,data) is inaccurate such that if >seq15 is seeked...the data for >seq13(i.e.RVGSALLDGGTVDQPIIERQKYSPH )is reported.

Any pointers would be highly appreciable.
Thanks
 
Old 04-27-2008, 09:10 PM   #11
osor
HCL Maintainer
 
Registered: Jan 2006
Distribution: (H)LFS, Gentoo
Posts: 2,450

Rep: Reputation: 78
Quote:
Originally Posted by fs11 View Post
Any pointers would be highly appreciable.
I can’t seem to find the exact error, but you are making it too complicated (and verbose). Here is what your loop might look like.
Code:
	std::string key, data;
	size_t pos;
	while(std::getline(inFile, data, '>')) {
		//skip the trivial or malformed cases
		if((pos=data.find('\n')) == std::string::npos)
			continue;

		key = data.substr(0, pos);
		if(tMap.find(key) == tMap.end())
			tMap[key] = data;
		else
			tMap[key] = "";
	}
Then, to print out the data, you could do
Code:
	TStrStrMap::iterator tMapIt;
	for(tMapIt = tMap.begin(); tMapIt != tMap.end(); tMapIt++) {
		std::string data = (*tMapIt).second;
		if(!data.empty())
			std::cout << '<' << data;
	}

Last edited by osor; 04-27-2008 at 09:11 PM.
 
Old 04-29-2008, 03:46 PM   #12
fs11
Member
 
Registered: Aug 2006
Posts: 79

Original Poster
Rep: Reputation: 15
Thanks alot.I have got the program running ....

thanks again..
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Eliminate NSAlinux in Fedora 8? PatM Linux - Security 6 04-08-2008 11:39 AM
Eliminate touchpad aeruzcar Linux - Hardware 2 06-15-2006 02:12 PM
After Editing Tags with JuK - XMMS do not display tags correctly Artik Linux - Software 0 07-23-2005 05:55 AM
Also eliminate partition table? jeopardyracing Linux - Newbie 4 02-21-2005 06:21 AM
How to eliminate the spyware ? emailssent General 32 11-01-2004 09:49 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 07:54 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration