ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Below is the file format from which I need to parse some information I am not able to figure out how to do it. Please if any one of you can help me out.
The information which needs to be parsed is the value of the following variables
1)Score
2)Expect
3)Identities
4)Positives
5)Gaps
6)length
each time they appears
I have increased the font size of these variable in the file at the first occurrence for better visibility
I need to store this information in some kind of array so that I can process it according to the need.
FROM HERE THE .txt FILE STARTS
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs", Nucleic Acids Res. 25:3389-3402.
This task will require a lot of string processing. Are you sure you need to do this in C++? There are good reasons to use tools such as perl and awk, which make this task almost trivial. Are you acquainted with regular expressions as a general concept? There are regex libraries for C, which I assume means C++ also. Alternatively, a parser created with (f)lex can be generated fairly easily for your task. That would probably be my first choice.
A sample of what you want the output to look like would be helpful & less ambiguous.
--- rod.
The main question is how does it need to be represented in memory in order for you to work with it? In other words, will all parsed data exist at one time and be processed somewhat as a whole, or will the parser process one segment at a time and move onto the next?
I wrote a data parsing library in C++ a while ago. It assembles data into a hierarchy for the purposes of importing and exporting easily, but requires a parallel interface based on the requirement in order to manipulate the data. The whole thing is a lot more complex than is practical for your application, I'm afraid.
I'd actually recommend using libc functions such as getline, strtok, and strcmp if you can get away with dumping each line right after you've dealt with it. If you actually need the document represented in memory then that's an entirely different problem, but I may be of help in that area, also.
ta0kira
Using regular expressions will definitely help here. One of the reasons perl makes this sort of task extremely quick and simple is the tight integration of regex functionality right in the language.
Since you didn't specify that you don't want to use QT, I'll make you a little example using QT.
Code:
#include <QApplication>
#include <QFile>
#include <QString>
#include <QRegExp>
#include <iostream>
using namespace std;
int main(int argc, char** argv)
{
QCoreApplication(argc, argv);
QFile file("data");
if (!file.open(QIODevice::ReadOnly | QIODevice::Text))
{
cerr << "could not open file \"data\" for reading" << endl;
return 1;
}
while (!file.atEnd()) {
QString line = file.readLine();
QRegExp rx("^Length\\s*=\\s*(\\d+)\\s*$");
if (rx.indexIn(line)>=0)
{
cout << "Found length: " << qPrintable(rx.cap(1)) << endl;
cout << "What you do with it is your business..." << endl;
}
}
file.close();
return 0;
}
This example just matches one of the things you are interested in, and I don't bother with storing the results in an array - how that is done will depend on the rest of your program. This just shows an example of a regex library in use. QT makes life quite easy for you.
The key thing in the example above is the regular expression itself:
Code:
"^Length\\s*=\\s*(\\d+)\\s*$"
There are a few things to understand here.
The re is passed to the QRegExp class as a string. Regular expressions - especially the perl style extensions to regular expressions which may be used with the QRegExp class - make heavy use of the backslash character. This is a little unfortunate as character strings in C programs need to escape the backslash with another backslash... hence the ugly double backslashes. Ho hum.
Parts of a regular expression in (brackets) will be stored if there is a match, and these sub-patterns may be retrieved with the cap member function.
This whole program can be written in perl like this:
Code:
#!/usr/bin/perl -n
if ( /^Length\s*=\s*(\d+)\s*$/ ) {
print "Found length: $1\nWhat you do with it is your business...\n";
}
This is a simple example, done all in C. C++ is also supported by flex. You haven't said what you want to do with the parser output, so the trivial case is to echo to standard output.
--- rod.
thanx a lot 4 ur replys guys , but currently I am stuck up in some other urgent work , so it will take 2-3 days before I can start this thing again. Until then I wont be able to look deeply into your replies. I will catch back after 2-3 days.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.