LinuxQuestions.org
View the Most Wanted LQ Wiki articles.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
LinkBack Search this Thread
Old 08-23-2010, 07:29 PM   #1
Feynman
Member
 
Registered: Aug 2010
Distribution: Gentoo
Posts: 62

Rep: Reputation: 15
Methods for extracting data strings from output files


I am trying to develop a method of reading files generated by other programs. I am trying to find the most versatile approach. I have been trying bash, and have been making good progress with sed, however I was wondering if there was a "standard" approach to this sort of thing.

The main features I would like to implement concern reading finding strings based on various forms of context and storing them to variables and/or arrays. Here are the most general tasks:
a) Read the first word(or floating point) that comes after a given string (solved in another thread)
b) Read the nth line after a given string
c) Read all text between two given strings
d) Save the output of task a), task b) or task c) (above) into an array if the "given string(s)" is/are not unique.
e)Read text between two non-unique strings i.e. text between the nth occurrence of string1 and the mth occurrence of string2

As far as I can tell, those five scripts should be able to parse just about any text pattern.

Does anyone have any suggestions for approaches (perl, sed, bash, etc)? I am by no means fluent in these languages. But I could use a starting point. My main concern is speed. I intend to use these scripts in a program that reads and writes hundreds of input and output files--each with a different value of some parameter(s). The files will most likely be no more than a few dozen lines, but I can think of some applications that could generate a few hundred lines. I have the input file generator down pretty well. Parsing the output is quite a bit trickier.

And, of course, the option for parallelization will be very desirable for many practical applications.

And if anyone cares to take a crack at writing a script that preforms these tasks please share!
 
Old 08-23-2010, 08:14 PM   #2
wje_lq
Member
 
Registered: Sep 2007
Location: Mariposa
Distribution: Debian lenny, Slackware 12
Posts: 806

Rep: Reputation: 161Reputation: 161
Quote:
Originally Posted by Feynman View Post
I am trying to develop a method of reading files generated by other programs. I am trying to find the most versatile approach.
Oh. I thought your main concern was speed. Oh, wait, it is:
Quote:
Originally Posted by Feynman View Post
My main concern is speed. I intend to use these scripts in a program that reads and writes hundreds of input and output files--each with a different value of some parameter(s). The files will most likely be no more than a few dozen lines, but I can think of some applications that could generate a few hundred lines.
If most of your files are small, and your main concern is speed, it could very well end up that your main speed bottleneck is the time it takes to open each file. If this is the case, then it doesn't matter what language or approach you use.

The second most probable speed bottleneck is the time it takes to parse each file. If your main concern is speed, it might well be that you want to use lex and yacc (or, on Linux systems, flex and bison). To use these, you will be learning C or C++.

Learning C or C++ probably represents a development effort that is an order of magnitude higher than you want to invest. If that's the case, perhaps your main concern, at least at the beginning, is not speed, but effort required to develop the software.
Quote:
Originally Posted by Feynman View Post
And if anyone cares to take a crack at writing a script that preforms these tasks please share!
"Anyone", in this case, would be Feynman. We normally don't do that sort of stuff around here. Show us some code, tell us precisely why it's not working, and we'll probably be glad to comment.

Hope this helps.
 
1 members found this post helpful.
Old 08-23-2010, 08:15 PM   #3
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 4,481

Rep: Reputation: 451Reputation: 451Reputation: 451Reputation: 451Reputation: 451
Quote:
Originally Posted by Feynman View Post
...
As far as I can tell, those five scripts should be able to parse just about any text pattern.

Does anyone have any suggestions for approaches (perl, sed, bash, etc)?
...
There is no such thing as "be able to parse just about any text pattern" - you have to define and implement your input language. So maybe better do some standardization in the output formats.

Yes, I recommend Perl.

Wherever I could define standards, I was making programs to output data as Perl hierarchical data structures, so no parsing was necessary in the first place, rather Perl itself was used as the parser.
 
1 members found this post helpful.
Old 08-23-2010, 08:45 PM   #4
Feynman
Member
 
Registered: Aug 2010
Distribution: Gentoo
Posts: 62

Original Poster
Rep: Reputation: 15
I guess I give perl a shot. I will post a new thread if/when I hit a road block. Thanks for the long responses to my admittedly vague questions!

Last edited by Feynman; 08-23-2010 at 09:00 PM.
 
Old 08-23-2010, 08:51 PM   #5
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 4,481

Rep: Reputation: 451Reputation: 451Reputation: 451Reputation: 451Reputation: 451
Perl's regular expressions engine is quite well optimized, so you may expect quite good speed.
 
1 members found this post helpful.
Old 08-23-2010, 09:17 PM   #6
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,695
Blog Entries: 5

Rep: Reputation: 239Reputation: 239Reputation: 239
Quote:
Originally Posted by Feynman View Post
I guess I give perl a shot.
you can do all those 5 tasks you mention with (g)awk as well. See my sig to learn about gawk

Quote:
a) Read the first word(or floating point) that comes after a given string (solved in another thread)
Code:
awk '/pattern/{f=1}f{for(i=1;i<=NF;i++) {if($i ~/[0-9]+\.[0-9]+/) print $i} }' file
Quote:
b) Read the nth line after a given string
Code:
awk 'c&&c--;/pattern/{c=2}' file
Quote:
c) Read all text between two given strings
Code:
awk -vRS="string2" '/string1/{gsub(/.*string1/,"");print}' file
Quote:
d) Save the output of task a), task b) or task c) (above) into an array if the "given string(s)" is/are not unique.
don't understand, but i am 100% sure this is easy to do as well

Quote:
e)Read text between two non-unique strings i.e. text between the nth occurrence of string1 and the mth occurrence of string2
one way
Code:
awk 'FNR==NR{
  for(i=1;i<=NF;i++) {
     if($i == "string1"){
        tm++
        if( tm==3 ){ linetm=FNR} #3rd occurence
     };
     if($i =="string2") {  
       sm++
       if( sm==2) { linesm=FNR} #2nd occurence 
     }
  }
  next
}
FNR > linesm && FNR < linetm{
  print
} ' file file
not perfect, but you get the drift. And don't worry about awk's speed. It can be as fast, and sometimes faster than Perl/Python.

Last edited by ghostdog74; 08-23-2010 at 10:12 PM.
 
1 members found this post helpful.
Old 08-23-2010, 09:25 PM   #7
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 4,481

Rep: Reputation: 451Reputation: 451Reputation: 451Reputation: 451Reputation: 451
Quote:
Originally Posted by ghostdog74 View Post
you can do all those 5 tasks you mention with (g)awk as well. See my sig to learn about gawk
Still, 'awk' is an "underlanguage".
 
1 members found this post helpful.
Old 08-23-2010, 09:27 PM   #8
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 4,481

Rep: Reputation: 451Reputation: 451Reputation: 451Reputation: 451Reputation: 451
Quote:
Originally Posted by Feynman View Post
I guess I give perl a shot. ...
http://perldoc.perl.org/
 
1 members found this post helpful.
Old 08-23-2010, 09:43 PM   #9
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,695
Blog Entries: 5

Rep: Reputation: 239Reputation: 239Reputation: 239
Quote:
Originally Posted by Sergei Steshenko View Post
Still, 'awk' is an "underlanguage".
awk is the grandfather of Perl. And for his purpose,ie parsing files, awk is enough, and sometimes even faster than Perl. You are hereby advised to look at awk.info , especially this
 
1 members found this post helpful.
Old 08-23-2010, 09:52 PM   #10
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 4,481

Rep: Reputation: 451Reputation: 451Reputation: 451Reputation: 451Reputation: 451
Quote:
Originally Posted by ghostdog74 View Post
awk is the grandfather of Perl. And for his purpose,ie parsing files, awk is enough, and sometimes even faster than Perl. You are hereby advised to look at awk.info , especially this
I know that 'awk' is the grandfather of Perl. Languages like Perl/Python/Ruby were invented to avoid the necessity to deal with the (un)holly sh/sed/awk trinity.
 
1 members found this post helpful.
Old 08-23-2010, 09:59 PM   #11
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,695
Blog Entries: 5

Rep: Reputation: 239Reputation: 239Reputation: 239
Quote:
Originally Posted by Sergei Steshenko View Post
I know that 'awk' is the grandfather of Perl. Languages like Perl/Python/Ruby were invented to avoid the necessity to deal with the (un)holly sh/sed/awk trinity.
Back up your point on (un)holly sh/sed/awk trinity with facts and figures as pertaining to OP's question. In other words, tell us why you think awk is not recommended (by you) to be used in this case. Otherwise, your comment does not hold any weight.

Last edited by ghostdog74; 08-23-2010 at 10:01 PM.
 
1 members found this post helpful.
Old 08-23-2010, 10:12 PM   #12
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 4,481

Rep: Reputation: 451Reputation: 451Reputation: 451Reputation: 451Reputation: 451
Quote:
Originally Posted by ghostdog74 View Post
...with ... figures as pertaining to OP's question. ...
One language instead of many.
 
1 members found this post helpful.
Old 08-23-2010, 10:22 PM   #13
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,695
Blog Entries: 5

Rep: Reputation: 239Reputation: 239Reputation: 239
Quote:
Originally Posted by Sergei Steshenko View Post
One language instead of many.
wrong. awk is a programming language. There is no need to use other tools to parse files. Just awk is enough. So where is the "many language" you are talking about? The point is this, you can get the job done with awk, Perl/Python/Ruby whatever for his case. Your argument of awk being the "underlanguage" and not suitable for the job because you think that only Perl/Python/Ruby can do the job is flawed and weak. I have proven that awk can also do it in my reply to OP's particular task.
 
1 members found this post helpful.
Old 08-23-2010, 10:32 PM   #14
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 4,481

Rep: Reputation: 451Reputation: 451Reputation: 451Reputation: 451Reputation: 451
Quote:
Originally Posted by ghostdog74 View Post
wrong. awk is a programming language. There is no need to use other tools to parse files. Just awk is enough. So where is the "many language" you are talking about? The point is this, you can get the job done with awk, Perl/Python/Ruby whatever for his case. Your argument of awk being the "underlanguage" and not suitable for the job because you think that only Perl/Python/Ruby can do the job is flawed and weak. I have proven that awk can also do it in my reply to OP's particular task.
I know that dealing with massive amounts of scientific data will reveal other than parsing needs.
 
1 members found this post helpful.
Old 08-23-2010, 10:50 PM   #15
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,695
Blog Entries: 5

Rep: Reputation: 239Reputation: 239Reputation: 239
Quote:
Originally Posted by Sergei Steshenko View Post
I know that dealing with massive amounts of scientific data will reveal other than parsing needs.
So do you think awk has no capabilities to handle scientific data? If you can show us (back up) this comment with some facts/examples, i will believe what you say. Other than that, you are just putting too much assumptions into the original problem(question) and saying something that has no concrete proof, (that awk is not suitable for his tasks. Yes, read the key words, his tasks)
 
1 members found this post helpful.
  


Reply

Tags
data, file, parse, string, text


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
extracting lines from very large data files lothario Linux - Software 1 12-15-2009 09:22 PM
AWK/Perl for extracting data from txt file to numerous other files briana.paige Linux - Newbie 2 05-05-2009 09:53 AM
Extracting ASCII strings from a Binary files poorrej Linux - Newbie 2 10-31-2008 03:38 AM
extracting data from html files into one text file adityavpratap Slackware 9 05-10-2007 10:30 AM
Extracting MySQL data from raw files cs-cam Linux - Software 1 06-12-2006 11:22 PM


All times are GMT -5. The time now is 06:54 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration