[SOLVED] Methods for extracting data strings from output files
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Methods for extracting data strings from output files
I am trying to develop a method of reading files generated by other programs. I am trying to find the most versatile approach. I have been trying bash, and have been making good progress with sed, however I was wondering if there was a "standard" approach to this sort of thing.
The main features I would like to implement concern reading finding strings based on various forms of context and storing them to variables and/or arrays. Here are the most general tasks:
a) Read the first word(or floating point) that comes after a given string (solved in another thread)
b) Read the nth line after a given string
c) Read all text between two given strings
d) Save the output of task a), task b) or task c) (above) into an array if the "given string(s)" is/are not unique.
e)Read text between two non-unique strings i.e. text between the nth occurrence of string1 and the mth occurrence of string2
As far as I can tell, those five scripts should be able to parse just about any text pattern.
Does anyone have any suggestions for approaches (perl, sed, bash, etc)? I am by no means fluent in these languages. But I could use a starting point. My main concern is speed. I intend to use these scripts in a program that reads and writes hundreds of input and output files--each with a different value of some parameter(s). The files will most likely be no more than a few dozen lines, but I can think of some applications that could generate a few hundred lines. I have the input file generator down pretty well. Parsing the output is quite a bit trickier.
And, of course, the option for parallelization will be very desirable for many practical applications.
And if anyone cares to take a crack at writing a script that preforms these tasks please share!
I am trying to develop a method of reading files generated by other programs. I am trying to find the most versatile approach.
Oh. I thought your main concern was speed. Oh, wait, it is:
Quote:
Originally Posted by Feynman
My main concern is speed. I intend to use these scripts in a program that reads and writes hundreds of input and output files--each with a different value of some parameter(s). The files will most likely be no more than a few dozen lines, but I can think of some applications that could generate a few hundred lines.
If most of your files are small, and your main concern is speed, it could very well end up that your main speed bottleneck is the time it takes to open each file. If this is the case, then it doesn't matter what language or approach you use.
The second most probable speed bottleneck is the time it takes to parse each file. If your main concern is speed, it might well be that you want to use lex and yacc (or, on Linux systems, flex and bison). To use these, you will be learning C or C++.
Learning C or C++ probably represents a development effort that is an order of magnitude higher than you want to invest. If that's the case, perhaps your main concern, at least at the beginning, is not speed, but effort required to develop the software.
Quote:
Originally Posted by Feynman
And if anyone cares to take a crack at writing a script that preforms these tasks please share!
"Anyone", in this case, would be Feynman. We normally don't do that sort of stuff around here. Show us some code, tell us precisely why it's not working, and we'll probably be glad to comment.
...
As far as I can tell, those five scripts should be able to parse just about any text pattern.
Does anyone have any suggestions for approaches (perl, sed, bash, etc)?
...
There is no such thing as "be able to parse just about any text pattern" - you have to define and implement your input language. So maybe better do some standardization in the output formats.
Yes, I recommend Perl.
Wherever I could define standards, I was making programs to output data as Perl hierarchical data structures, so no parsing was necessary in the first place, rather Perl itself was used as the parser.
awk is the grandfather of Perl. And for his purpose,ie parsing files, awk is enough, and sometimes even faster than Perl. You are hereby advised to look at awk.info , especially this
awk is the grandfather of Perl. And for his purpose,ie parsing files, awk is enough, and sometimes even faster than Perl. You are hereby advised to look at awk.info , especially this
I know that 'awk' is the grandfather of Perl. Languages like Perl/Python/Ruby were invented to avoid the necessity to deal with the (un)holly sh/sed/awk trinity.
I know that 'awk' is the grandfather of Perl. Languages like Perl/Python/Ruby were invented to avoid the necessity to deal with the (un)holly sh/sed/awk trinity.
Back up your point on (un)holly sh/sed/awk trinity with facts and figures as pertaining to OP's question. In other words, tell us why you think awk is not recommended (by you) to be used in this case. Otherwise, your comment does not hold any weight.
Last edited by ghostdog74; 08-23-2010 at 10:01 PM.
wrong. awk is a programming language. There is no need to use other tools to parse files. Just awk is enough. So where is the "many language" you are talking about? The point is this, you can get the job done with awk, Perl/Python/Ruby whatever for his case. Your argument of awk being the "underlanguage" and not suitable for the job because you think that only Perl/Python/Ruby can do the job is flawed and weak. I have proven that awk can also do it in my reply to OP's particular task.
wrong. awk is a programming language. There is no need to use other tools to parse files. Just awk is enough. So where is the "many language" you are talking about? The point is this, you can get the job done with awk, Perl/Python/Ruby whatever for his case. Your argument of awk being the "underlanguage" and not suitable for the job because you think that only Perl/Python/Ruby can do the job is flawed and weak. I have proven that awk can also do it in my reply to OP's particular task.
I know that dealing with massive amounts of scientific data will reveal other than parsing needs.
I know that dealing with massive amounts of scientific data will reveal other than parsing needs.
So do you think awk has no capabilities to handle scientific data? If you can show us (back up) this comment with some facts/examples, i will believe what you say. Other than that, you are just putting too much assumptions into the original problem(question) and saying something that has no concrete proof, (that awk is not suitable for his tasks. Yes, read the key words, his tasks)
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.