ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I have a small problem that I would like to solve with C/C++ code.Currently, I am doing it with bash scripting, but it is slow for the application and does not always give desired results.
I think that you need to explain your needs a little better. Any advice about what you would require from the description given would just be a guess. Try and explain what you mean by a "tag" and outline the algorithm that you are after.
By tag i mean ">name-of-data" at the start of every data.
The objective is that 2 files are read and the tags which are in common b/w the 2 files along with it data will be eliminated. The new result is written in newfile.
forexample:
In the above example, file1 contains seq {1234} and file2 has seq {12}. Therefore the 2 tags common b/w 2 files are seq {12}, though the data is bit different. These 2 tags along with their data is eliminated and is written in a newfile.
What kind of help are you looking for? We are not a program-writing service. There are, however, a few ways to do this conceptually. In terms of implementation, it takes around 5 lines in awk or perl, but it will take a few more in C++ and quite a bit more in C.
One way to do it is this:
Read each record into memory.
Separate each record into a “tag” and a “data” portion.
In an associative array variable, if the tag used as a key already has an entry, assign a null value to that entry (but don’t delete it completely). Otherwise, create a new entry with the tag as the key and the data (or pointer to data) as the value.
After you are done reading records, iterate through the associative array and print the tag and data only if the data is non-null.
In many scripting and “glue” languages (awk, perl, python, etc.) associative arrays are built into the language. In C++, you can use std::map. In C, you do this manually with a hash function. The std::map version will probably not be terribly faster than the scripted versions (if done correctly).
Otherwise, create a new entry with the tag as the key and the data (or pointer to data) as the value.
I neglected to mention a two-pass variation on the above. In the first pass, the entry value is either set to true or false depending on whether it is the first or subsequent sighting of the tag. Then, on the second pass, the truth of the entry associated with a tag is consulted before deciding whether to print the tag and data.
The two-pass version will use less allocated memory, and will thus be faster in some cases (e.g., if you have very large files). On the other hand, you cannot use a two-pass version when you don’t have a seekable file (e.g., you are getting input from a pipe or something).
My reading of the problem is that an entry will always be printed: if a singleton, then written to (say) STDOUT, if a multiple, then to a different file. This is based on the OP phrase:
Quote:
Originally Posted by fs11
The "tags" that are common are eliminated with their corresponding data and is stored in another file...
I have not thought through the idea of a second pass, but if the input is coming from STDIN, then one could test for that case and copy the input to a temporary file to prepare for the second pass ... cheers, makyo
I have started writing the code.The idea is to make a (key,data) mapping for the two files and then one of them may be iterated to do the further processing.
However, I am having a problem, that I am not able to solve after much effort.
Code:
#include <iostream>
#include <map>
#include <string>
#include <iomanip>
#include <fstream>
#include <vector>
using namespace std;
typedef std::map<std::string, std::string> TStrStrMap;
typedef std::pair<std::string, std::string> TStrStrPair;
int main(int argc, char *argv[])
{
TStrStrMap tMap;
vector<string> stringVector;
string x,id,mystring,check,mystring1="";
int count=-1,count1=0;
ifstream inFile;
inFile.open("testfile");
if (!inFile) {
cout << "Unable to open file";
exit(1); // terminate with error
}
while (inFile >> x) {
count++;
stringVector.push_back(x); //putting kmer value in vector
//cout<< x << endl;
mystring=stringVector[count];
check=mystring.substr(0, 1);
if (check ==">" ){ //if first char is the > then it is the name of the sequence
id=mystring;
cout << mystring << endl;
//mystring1="";
tMap.insert(TStrStrPair(id, mystring1));
mystring1="";
}
else{
mystring1=mystring1+x;
cout << mystring1 << endl;
}
####code truncated
The problem is that the mapping is not quite accurate.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.