LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (http://www.linuxquestions.org/questions/programming-9/)
-   -   Methods for extracting data strings from output files (http://www.linuxquestions.org/questions/programming-9/methods-for-extracting-data-strings-from-output-files-828088/)

Feynman 08-23-2010 07:29 PM

Methods for extracting data strings from output files
 
I am trying to develop a method of reading files generated by other programs. I am trying to find the most versatile approach. I have been trying bash, and have been making good progress with sed, however I was wondering if there was a "standard" approach to this sort of thing.

The main features I would like to implement concern reading finding strings based on various forms of context and storing them to variables and/or arrays. Here are the most general tasks:
a) Read the first word(or floating point) that comes after a given string (solved in another thread)
b) Read the nth line after a given string
c) Read all text between two given strings
d) Save the output of task a), task b) or task c) (above) into an array if the "given string(s)" is/are not unique.
e)Read text between two non-unique strings i.e. text between the nth occurrence of string1 and the mth occurrence of string2

As far as I can tell, those five scripts should be able to parse just about any text pattern.

Does anyone have any suggestions for approaches (perl, sed, bash, etc)? I am by no means fluent in these languages. But I could use a starting point. My main concern is speed. I intend to use these scripts in a program that reads and writes hundreds of input and output files--each with a different value of some parameter(s). The files will most likely be no more than a few dozen lines, but I can think of some applications that could generate a few hundred lines. I have the input file generator down pretty well. Parsing the output is quite a bit trickier.

And, of course, the option for parallelization will be very desirable for many practical applications.

And if anyone cares to take a crack at writing a script that preforms these tasks please share!

wje_lq 08-23-2010 08:14 PM

Quote:

Originally Posted by Feynman (Post 4075563)
I am trying to develop a method of reading files generated by other programs. I am trying to find the most versatile approach.

Oh. I thought your main concern was speed. Oh, wait, it is:
Quote:

Originally Posted by Feynman (Post 4075563)
My main concern is speed. I intend to use these scripts in a program that reads and writes hundreds of input and output files--each with a different value of some parameter(s). The files will most likely be no more than a few dozen lines, but I can think of some applications that could generate a few hundred lines.

If most of your files are small, and your main concern is speed, it could very well end up that your main speed bottleneck is the time it takes to open each file. If this is the case, then it doesn't matter what language or approach you use.

The second most probable speed bottleneck is the time it takes to parse each file. If your main concern is speed, it might well be that you want to use lex and yacc (or, on Linux systems, flex and bison). To use these, you will be learning C or C++.

Learning C or C++ probably represents a development effort that is an order of magnitude higher than you want to invest. If that's the case, perhaps your main concern, at least at the beginning, is not speed, but effort required to develop the software.
Quote:

Originally Posted by Feynman (Post 4075563)
And if anyone cares to take a crack at writing a script that preforms these tasks please share!

"Anyone", in this case, would be Feynman. We normally don't do that sort of stuff around here. Show us some code, tell us precisely why it's not working, and we'll probably be glad to comment.

Hope this helps.

Sergei Steshenko 08-23-2010 08:15 PM

Quote:

Originally Posted by Feynman (Post 4075563)
...
As far as I can tell, those five scripts should be able to parse just about any text pattern.

Does anyone have any suggestions for approaches (perl, sed, bash, etc)?
...

There is no such thing as "be able to parse just about any text pattern" - you have to define and implement your input language. So maybe better do some standardization in the output formats.

Yes, I recommend Perl.

Wherever I could define standards, I was making programs to output data as Perl hierarchical data structures, so no parsing was necessary in the first place, rather Perl itself was used as the parser.

Feynman 08-23-2010 08:45 PM

I guess I give perl a shot. I will post a new thread if/when I hit a road block. Thanks for the long responses to my admittedly vague questions!

Sergei Steshenko 08-23-2010 08:51 PM

Perl's regular expressions engine is quite well optimized, so you may expect quite good speed.

ghostdog74 08-23-2010 09:17 PM

Quote:

Originally Posted by Feynman (Post 4075613)
I guess I give perl a shot.

you can do all those 5 tasks you mention with (g)awk as well. See my sig to learn about gawk

Quote:

a) Read the first word(or floating point) that comes after a given string (solved in another thread)
Code:

awk '/pattern/{f=1}f{for(i=1;i<=NF;i++) {if($i ~/[0-9]+\.[0-9]+/) print $i} }' file
Quote:

b) Read the nth line after a given string
Code:

awk 'c&&c--;/pattern/{c=2}' file
Quote:

c) Read all text between two given strings
Code:

awk -vRS="string2" '/string1/{gsub(/.*string1/,"");print}' file
Quote:

d) Save the output of task a), task b) or task c) (above) into an array if the "given string(s)" is/are not unique.
don't understand, but i am 100% sure this is easy to do as well

Quote:

e)Read text between two non-unique strings i.e. text between the nth occurrence of string1 and the mth occurrence of string2
one way
Code:

awk 'FNR==NR{
  for(i=1;i<=NF;i++) {
    if($i == "string1"){
        tm++
        if( tm==3 ){ linetm=FNR} #3rd occurence
    };
    if($i =="string2") { 
      sm++
      if( sm==2) { linesm=FNR} #2nd occurence
    }
  }
  next
}
FNR > linesm && FNR < linetm{
  print
} ' file file

not perfect, but you get the drift. And don't worry about awk's speed. It can be as fast, and sometimes faster than Perl/Python.

Sergei Steshenko 08-23-2010 09:25 PM

Quote:

Originally Posted by ghostdog74 (Post 4075640)
you can do all those 5 tasks you mention with (g)awk as well. See my sig to learn about gawk

Still, 'awk' is an "underlanguage".

Sergei Steshenko 08-23-2010 09:27 PM

Quote:

Originally Posted by Feynman (Post 4075613)
I guess I give perl a shot. ...

http://perldoc.perl.org/

ghostdog74 08-23-2010 09:43 PM

Quote:

Originally Posted by Sergei Steshenko (Post 4075648)
Still, 'awk' is an "underlanguage".

awk is the grandfather of Perl. And for his purpose,ie parsing files, awk is enough, and sometimes even faster than Perl. You are hereby advised to look at awk.info , especially this

Sergei Steshenko 08-23-2010 09:52 PM

Quote:

Originally Posted by ghostdog74 (Post 4075658)
awk is the grandfather of Perl. And for his purpose,ie parsing files, awk is enough, and sometimes even faster than Perl. You are hereby advised to look at awk.info , especially this

I know that 'awk' is the grandfather of Perl. Languages like Perl/Python/Ruby were invented to avoid the necessity to deal with the (un)holly sh/sed/awk trinity.

ghostdog74 08-23-2010 09:59 PM

Quote:

Originally Posted by Sergei Steshenko (Post 4075660)
I know that 'awk' is the grandfather of Perl. Languages like Perl/Python/Ruby were invented to avoid the necessity to deal with the (un)holly sh/sed/awk trinity.

Back up your point on (un)holly sh/sed/awk trinity with facts and figures as pertaining to OP's question. In other words, tell us why you think awk is not recommended (by you) to be used in this case. Otherwise, your comment does not hold any weight.

Sergei Steshenko 08-23-2010 10:12 PM

Quote:

Originally Posted by ghostdog74 (Post 4075666)
...with ... figures as pertaining to OP's question. ...

One language instead of many.

ghostdog74 08-23-2010 10:22 PM

Quote:

Originally Posted by Sergei Steshenko (Post 4075673)
One language instead of many.

wrong. awk is a programming language. There is no need to use other tools to parse files. Just awk is enough. So where is the "many language" you are talking about? The point is this, you can get the job done with awk, Perl/Python/Ruby whatever for his case. Your argument of awk being the "underlanguage" and not suitable for the job because you think that only Perl/Python/Ruby can do the job is flawed and weak. I have proven that awk can also do it in my reply to OP's particular task.

Sergei Steshenko 08-23-2010 10:32 PM

Quote:

Originally Posted by ghostdog74 (Post 4075677)
wrong. awk is a programming language. There is no need to use other tools to parse files. Just awk is enough. So where is the "many language" you are talking about? The point is this, you can get the job done with awk, Perl/Python/Ruby whatever for his case. Your argument of awk being the "underlanguage" and not suitable for the job because you think that only Perl/Python/Ruby can do the job is flawed and weak. I have proven that awk can also do it in my reply to OP's particular task.

I know that dealing with massive amounts of scientific data will reveal other than parsing needs.

ghostdog74 08-23-2010 10:50 PM

Quote:

Originally Posted by Sergei Steshenko (Post 4075685)
I know that dealing with massive amounts of scientific data will reveal other than parsing needs.

So do you think awk has no capabilities to handle scientific data? If you can show us (back up) this comment with some facts/examples, i will believe what you say. Other than that, you are just putting too much assumptions into the original problem(question) and saying something that has no concrete proof, (that awk is not suitable for his tasks. Yes, read the key words, his tasks)


All times are GMT -5. The time now is 02:14 PM.