LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 11-21-2007, 12:36 AM   #1
abhisheknayak
LQ Newbie
 
Registered: Nov 2007
Posts: 6

Rep: Reputation: 0
how to parse info from a txt file using c++ code


hi guys,

Below is the file format from which I need to parse some information I am not able to figure out how to do it. Please if any one of you can help me out.

The information which needs to be parsed is the value of the following variables

1)Score
2)Expect
3)Identities
4)Positives
5)Gaps
6)length
each time they appears

I have increased the font size of these variable in the file at the first occurrence for better visibility

I need to store this information in some kind of array so that I can process it according to the need.






FROM HERE THE .txt FILE STARTS








Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs", Nucleic Acids Res. 25:3389-3402.

Query= NC_004314.fna|:20422-20919translated
(166 letters)

Database: /home/condor/c_share/nr_db_16_08_2007/nr
5,371,018 sequences; 1,856,561,620 total letters

Searching..................................................done

Score E
Sequences producing significant alignments: (bits) Value

ref|YP_720015.1| Rho termination factor-like [Trichodesmium eryt... 75 1e-12
ref|XP_001092616.1| PREDICTED: similar to heterogeneous nuclear ... 70 3e-11
ref|XP_763041.1| hypothetical protein TP03_0022 [Theileria parva... 67 3e-10
ref|ZP_01039068.1| outer membrane protein [Erythrobacter sp. NAP... 65 1e-09
ref|ZP_01255327.1| Hep_Hag family protein [Psychroflexus torquis... 63 5e-09
ref|ZP_01624364.1| hypothetical protein L8106_17907 [Lyngbya sp.... 62 1e-08
ref|XP_001128706.1| PREDICTED: hypothetical protein [Homo sapiens] 57 5e-07
ref|XP_001346095.1| PREDICTED: similar to kinesin-like protein, ... 55 1e-06
ref|ZP_01701151.1| Haemagluttinin domain protein [Escherichia co... 54 3e-06
ref|ZP_00736637.1| COG5295: Autotransporter adhesin [Escherichia... 54 3e-06
ref|ZP_00707581.1| COG5295: Autotransporter adhesin [Escherichia... 54 3e-06
ref|XP_385408.1| hypothetical protein FG05232.1 [Gibberella zeae... 54 3e-06
ref|XP_001348709.1| hypothetical protein PF14_0535 [Plasmodium f... 54 4e-06
ref|XP_001236219.1| PREDICTED: hypothetical protein, partial [Ga... 53 5e-06
ref|XP_765625.1| hypothetical protein TP01_0098 [Theileria parva... 53 5e-06
ref|XP_001071354.1| PREDICTED: similar to DNA-directed RNA polym... 53 7e-06

>ref|YP_720015.1| Rho termination factor-like [Trichodesmium erythraeum IMS101]
gb|ABG49542.1| Rho termination factor-like [Trichodesmium erythraeum IMS101]
Length = 731

Score = 75.1 bits (183), Expect = 1e-12
Identities = 44/161 (27%), Positives = 75/161 (46%), Gaps = 3/161 (1%)

Query: 3 VGPKIPNIGPNVANIGLNLTNIGPKIMKVDPEIPNVGNIVLRLTNIGPKVMKVGPQILNV 62
VG +G + +GLN + +G V VG L + +G VG V
Sbjct: 358 VGLNASGVGLTASGVGLNASGVGLTASGVGLNASGVG---LTASGVGLNASGVGLTASGV 414

Query: 63 GPSVGNIGLKVTNVGPNIPNIGPNITNLGPNYSKVGLKVTNIGPNIMKVDPEIRNIDPNI 122
G + +GL + VG N +G + +G N S VGL + +G N+ V + N+
Sbjct: 415 GLNASGVGLTASGVGLNASGVGLTASGMGMNMSGVGLTASGMGMNMSGVGLTASGVGMNM 474

Query: 123 TNIGLNLSNIGPNITNVGLKLTKLGSKLTNLGHKGTSVGLN 163
+ +GL S +G N++ VGL + +G ++ +G + +G+N
Sbjct: 475 SGVGLTASGMGMNMSGVGLTASGMGMNMSGVGLTASGMGMN 515



Score = 75.1 bits (183), Expect = 1e-12
Identities = 42/161 (26%), Positives = 79/161 (48%), Gaps = 3/161 (1%)

Query: 3 VGPKIPNIGPNVANIGLNLTNIGPKIMKVDPEIPNVGNIVLRLTNIGPKVMKVGPQILNV 62
VG +G + +GLN + +G + + VG L + +G + VG V
Sbjct: 414 VGLNASGVGLTASGVGLNASGVGLTASGMGMNMSGVG---LTASGMGMNMSGVGLTASGV 470

Query: 63 GPSVGNIGLKVTNVGPNIPNIGPNITNLGPNYSKVGLKVTNIGPNIMKVDPEIRNIDPNI 122
G ++ +GL + +G N+ +G + +G N S VGL + +G N+ V + N+
Sbjct: 471 GMNMSGVGLTASGMGMNMSGVGLTASGMGMNMSGVGLTASGMGMNMSGVGLTASGMGMNM 530

Query: 123 TNIGLNLSNIGPNITNVGLKLTKLGSKLTNLGHKGTSVGLN 163
+ +GL S +G N++ VGL + +G ++ +G + +G+N
Sbjct: 531 SGVGLTASGMGMNMSGVGLTASGMGMNMSGVGLTASGMGMN 571



Score = 75.1 bits (183), Expect = 1e-12
Identities = 43/161 (26%), Positives = 76/161 (46%), Gaps = 3/161 (1%)

Query: 3 VGPKIPNIGPNVANIGLNLTNIGPKIMKVDPEIPNVGNIVLRLTNIGPKVMKVGPQILNV 62
VG +G + +GLN + +G V VG L + +G VG V
Sbjct: 372 VGLNASGVGLTASGVGLNASGVGLTASGVGLNASGVG---LTASGVGLNASGVGLTASGV 428

Query: 63 GPSVGNIGLKVTNVGPNIPNIGPNITNLGPNYSKVGLKVTNIGPNIMKVDPEIRNIDPNI 122
G + +GL + +G N+ +G + +G N S VGL + +G N+ V + N+
Sbjct: 429 GLNASGVGLTASGMGMNMSGVGLTASGMGMNMSGVGLTASGVGMNMSGVGLTASGMGMNM 488

Query: 123 TNIGLNLSNIGPNITNVGLKLTKLGSKLTNLGHKGTSVGLN 163
+ +GL S +G N++ VGL + +G ++ +G + +G+N
Sbjct: 489 SGVGLTASGMGMNMSGVGLTASGMGMNMSGVGLTASGMGMN 529



Score = 74.7 bits (182), Expect = 2e-12
Identities = 40/161 (24%), Positives = 81/161 (49%), Gaps = 3/161 (1%)

Query: 3 VGPKIPNIGPNVANIGLNLTNIGPKIMKVDPEIPNVGNIVLRLTNIGPKVMKVGPQILNV 62
VG + +G + +G+N++ +G + + VG L + +G + VG +
Sbjct: 470 VGMNMSGVGLTASGMGMNMSGVGLTASGMGMNMSGVG---LTASGMGMNMSGVGLTASGM 526

Query: 63 GPSVGNIGLKVTNVGPNIPNIGPNITNLGPNYSKVGLKVTNIGPNIMKVDPEIRNIDPNI 122
G ++ +GL + +G N+ +G + +G N S VGL + +G N+ V + N+
Sbjct: 527 GMNMSGVGLTASGMGMNMSGVGLTASGMGMNMSGVGLTASGMGMNMSGVGLTASGMGMNM 586

Query: 123 TNIGLNLSNIGPNITNVGLKLTKLGSKLTNLGHKGTSVGLN 163
+ +GL S +G N++ VGL + +G ++ +G + +G+N
Sbjct: 587 SGVGLTASGMGMNMSGVGLTASGMGMNMSGVGLTASGMGMN 627
 
Old 11-21-2007, 06:21 PM   #2
theNbomr
LQ 5k Club
 
Registered: Aug 2005
Distribution: OpenSuse, Fedora, Redhat, Debian
Posts: 5,395
Blog Entries: 2

Rep: Reputation: 903Reputation: 903Reputation: 903Reputation: 903Reputation: 903Reputation: 903Reputation: 903Reputation: 903
This task will require a lot of string processing. Are you sure you need to do this in C++? There are good reasons to use tools such as perl and awk, which make this task almost trivial. Are you acquainted with regular expressions as a general concept? There are regex libraries for C, which I assume means C++ also. Alternatively, a parser created with (f)lex can be generated fairly easily for your task. That would probably be my first choice.
A sample of what you want the output to look like would be helpful & less ambiguous.
--- rod.

Last edited by theNbomr; 11-21-2007 at 06:23 PM.
 
Old 11-21-2007, 11:52 PM   #3
abhisheknayak
LQ Newbie
 
Registered: Nov 2007
Posts: 6

Original Poster
Rep: Reputation: 0
Thanx for the reply I will look into your suggesstion and get back in case of concern. But one thing is for sure i need to do this in c++.
 
Old 11-22-2007, 12:24 AM   #4
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Rep: Reputation: Disabled
The main question is how does it need to be represented in memory in order for you to work with it? In other words, will all parsed data exist at one time and be processed somewhat as a whole, or will the parser process one segment at a time and move onto the next?

I wrote a data parsing library in C++ a while ago. It assembles data into a hierarchy for the purposes of importing and exporting easily, but requires a parallel interface based on the requirement in order to manipulate the data. The whole thing is a lot more complex than is practical for your application, I'm afraid.

I'd actually recommend using libc functions such as getline, strtok, and strcmp if you can get away with dumping each line right after you've dealt with it. If you actually need the document represented in memory then that's an entirely different problem, but I may be of help in that area, also.
ta0kira
 
Old 11-22-2007, 02:14 AM   #5
bigearsbilly
Senior Member
 
Registered: Mar 2004
Location: england
Distribution: FreeBSD, Debian, Mint, Puppy
Posts: 3,282

Rep: Reputation: 172Reputation: 172
look into
man regex(3) a c library.

why must you use C++, it's daft and lots of work.

I suppose your project manager says so.

it's like saying you want to saw a log with a breadknife,
it's good for bread but not a good tool for logs.
 
Old 11-22-2007, 02:44 AM   #6
matthewg42
Senior Member
 
Registered: Oct 2003
Location: UK
Distribution: Kubuntu 12.10 (using awesome wm though)
Posts: 3,530

Rep: Reputation: 62
Using regular expressions will definitely help here. One of the reasons perl makes this sort of task extremely quick and simple is the tight integration of regex functionality right in the language.

Since you didn't specify that you don't want to use QT, I'll make you a little example using QT.

Code:
#include <QApplication>
#include <QFile>
#include <QString>
#include <QRegExp>
#include <iostream>

using namespace std;


int main(int argc, char** argv)
{
        QCoreApplication(argc, argv);

        QFile file("data");
        if (!file.open(QIODevice::ReadOnly | QIODevice::Text))
        {
                cerr << "could not open file \"data\" for reading" << endl;
                return 1;
        }

        while (!file.atEnd()) {
                QString line = file.readLine();

                QRegExp rx("^Length\\s*=\\s*(\\d+)\\s*$");
                if (rx.indexIn(line)>=0)
                {
                        cout << "Found length: " << qPrintable(rx.cap(1)) << endl;
                        cout << "What you do with it is your business..." << endl;
                }

        }

        file.close();
        return 0;
}
This example just matches one of the things you are interested in, and I don't bother with storing the results in an array - how that is done will depend on the rest of your program. This just shows an example of a regex library in use. QT makes life quite easy for you.
 
Old 11-22-2007, 02:52 AM   #7
matthewg42
Senior Member
 
Registered: Oct 2003
Location: UK
Distribution: Kubuntu 12.10 (using awesome wm though)
Posts: 3,530

Rep: Reputation: 62
The key thing in the example above is the regular expression itself:
Code:
"^Length\\s*=\\s*(\\d+)\\s*$"
There are a few things to understand here.
  • The re is passed to the QRegExp class as a string. Regular expressions - especially the perl style extensions to regular expressions which may be used with the QRegExp class - make heavy use of the backslash character. This is a little unfortunate as character strings in C programs need to escape the backslash with another backslash... hence the ugly double backslashes. Ho hum.
  • Parts of a regular expression in (brackets) will be stored if there is a match, and these sub-patterns may be retrieved with the cap member function.
  • This whole program can be written in perl like this:
    Code:
    #!/usr/bin/perl -n
    
    if ( /^Length\s*=\s*(\d+)\s*$/ ) {
        print "Found length: $1\nWhat you do with it is your business...\n";
    }
  • Perl is great for this sort of thing
 
Old 11-22-2007, 10:19 AM   #8
theNbomr
LQ 5k Club
 
Registered: Aug 2005
Distribution: OpenSuse, Fedora, Redhat, Debian
Posts: 5,395
Blog Entries: 2

Rep: Reputation: 903Reputation: 903Reputation: 903Reputation: 903Reputation: 903Reputation: 903Reputation: 903Reputation: 903
Some lex code to build a simple parser. Echoes found records to stdout.
Code:
/* LQabhisheknayak.l   */

WS          [\t ]


%%

Length{WS}*={WS}*[^,\n]+           { printf( "LENGTH record: %s\n",     yytext ); }
Score{WS}*={WS}*[^,\n]+            { printf( "SCORE record: %s\n",      yytext ); }
Expect{WS}*={WS}*[^,\n]+           { printf( "EXPECT record: %s\n",     yytext ); }
Identities{WS}*={WS}*[^,\n]+       { printf( "IDENTITIES record: %s\n", yytext ); }
Positives{WS}*={WS}*[^,\n]+        { printf( "POSITIVES record: %s\n",  yytext ); }
Gaps{WS}*={WS}*[^,\n]+             { printf( "GAPS record: %s\n",       yytext ); }
.                                           ;
\n                                           ;

%%

#include 	<stdio.h>
#include 	<stdlib.h>

int     yywrap(){
    return 1;
}

/* Called by yyparse on error.
*/
void yyerror (char const *s){
    fprintf (stderr, "%s\n", s);
}
main( int argc, char * argv[] ){

    ++argv, --argc;  /* skip over program name */
    if ( argc > 0 )
            yyin = fopen( argv[0], "r" );
    else
            yyin = stdin;

    yylex();
}
Build with
Code:
make LQabhisheknayak
lex  -t LQabhisheknayak.l > LQabhisheknayak.c
cc    -c -o LQabhisheknayak.o LQabhisheknayak.c
cc   LQabhisheknayak.o   -o LQabhisheknayak
Run against sample text:
Code:
 ./LQabhisheknayak LQabhisheknayak.txt
LENGTH record: Length = 731
SCORE record: Score = 75.1 bits (183)
EXPECT record: Expect = 1e-12
IDENTITIES record: Identities = 44/161 (27%)
POSITIVES record: Positives = 75/161 (46%)
GAPS record: Gaps = 3/161 (1%)
SCORE record: Score = 75.1 bits (183)
EXPECT record: Expect = 1e-12
IDENTITIES record: Identities = 42/161 (26%)
POSITIVES record: Positives = 79/161 (48%)
GAPS record: Gaps = 3/161 (1%)
SCORE record: Score = 75.1 bits (183)
EXPECT record: Expect = 1e-12
IDENTITIES record: Identities = 43/161 (26%)
POSITIVES record: Positives = 76/161 (46%)
GAPS record: Gaps = 3/161 (1%)
SCORE record: Score = 74.7 bits (182)
EXPECT record: Expect = 2e-12
IDENTITIES record: Identities = 40/161 (24%)
POSITIVES record: Positives = 81/161 (49%)
GAPS record: Gaps = 3/161 (1%)
This is a simple example, done all in C. C++ is also supported by flex. You haven't said what you want to do with the parser output, so the trivial case is to echo to standard output.
--- rod.

Last edited by theNbomr; 11-22-2007 at 10:24 AM.
 
Old 11-23-2007, 01:33 AM   #9
abhisheknayak
LQ Newbie
 
Registered: Nov 2007
Posts: 6

Original Poster
Rep: Reputation: 0
thanx a lot 4 ur replys guys , but currently I am stuck up in some other urgent work , so it will take 2-3 days before I can start this thing again. Until then I wont be able to look deeply into your replies. I will catch back after 2-3 days.
 
  


Reply

Tags
c++, perl, qt, regex, regexp


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How can read from file.txt C++ where can save this file(file.txt) to start reading sam_22 Programming 1 01-11-2007 05:11 PM
using commands to output path, filename and info to a txt file bob_man_uk Linux - General 3 05-11-2006 02:31 PM
Convert an info file(bash.info.gz) to a single html file Darwish Linux - Software 2 09-24-2005 06:51 AM
parse info from ttyS1 franklin97355 Programming 1 12-31-2003 01:57 AM
How to convert a txt file to be a db file in Redhat linux 9? winnie Linux - Newbie 3 06-27-2003 08:33 AM


All times are GMT -5. The time now is 05:21 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration