Methods for extracting data strings from output files
I am trying to develop a method of reading files generated by other programs. I am trying to find the most versatile approach. I have been trying bash, and have been making good progress with sed, however I was wondering if there was a "standard" approach to this sort of thing.
The main features I would like to implement concern reading finding strings based on various forms of context and storing them to variables and/or arrays. Here are the most general tasks: a) Read the first word(or floating point) that comes after a given string (solved in another thread) b) Read the nth line after a given string c) Read all text between two given strings d) Save the output of task a), task b) or task c) (above) into an array if the "given string(s)" is/are not unique. e)Read text between two non-unique strings i.e. text between the nth occurrence of string1 and the mth occurrence of string2 As far as I can tell, those five scripts should be able to parse just about any text pattern. Does anyone have any suggestions for approaches (perl, sed, bash, etc)? I am by no means fluent in these languages. But I could use a starting point. My main concern is speed. I intend to use these scripts in a program that reads and writes hundreds of input and output files--each with a different value of some parameter(s). The files will most likely be no more than a few dozen lines, but I can think of some applications that could generate a few hundred lines. I have the input file generator down pretty well. Parsing the output is quite a bit trickier. And, of course, the option for parallelization will be very desirable for many practical applications. And if anyone cares to take a crack at writing a script that preforms these tasks please share! |
Quote:
Quote:
The second most probable speed bottleneck is the time it takes to parse each file. If your main concern is speed, it might well be that you want to use lex and yacc (or, on Linux systems, flex and bison). To use these, you will be learning C or C++. Learning C or C++ probably represents a development effort that is an order of magnitude higher than you want to invest. If that's the case, perhaps your main concern, at least at the beginning, is not speed, but effort required to develop the software. Quote:
Hope this helps. |
Quote:
Yes, I recommend Perl. Wherever I could define standards, I was making programs to output data as Perl hierarchical data structures, so no parsing was necessary in the first place, rather Perl itself was used as the parser. |
I guess I give perl a shot. I will post a new thread if/when I hit a road block. Thanks for the long responses to my admittedly vague questions!
|
Perl's regular expressions engine is quite well optimized, so you may expect quite good speed.
|
Quote:
Quote:
Code:
awk '/pattern/{f=1}f{for(i=1;i<=NF;i++) {if($i ~/[0-9]+\.[0-9]+/) print $i} }' file Quote:
Code:
awk 'c&&c--;/pattern/{c=2}' file Quote:
Code:
awk -vRS="string2" '/string1/{gsub(/.*string1/,"");print}' file Quote:
Quote:
Code:
awk 'FNR==NR{ |
Quote:
|
Quote:
|
|
I know that 'awk' is the grandfather of Perl. Languages like Perl/Python/Ruby were invented to avoid the necessity to deal with the (un)holly sh/sed/awk trinity.
|
Quote:
|
Quote:
|
Quote:
|
Quote:
|
Quote:
|
Quote:
|
WOW! Thank you so much! I cannot blame you for not fully understanding task d. I can give a simple but general example for the task a) implementation:
text file reads: blah blah blah add this word to the list: 1234.56 blah blah blah blah blah now don't forget to add this word to the list: PINAPPLE blah blah And for bonus points, it would be nice to know that the script would be able to add this word to the list: 1!@#$%^&*()[]{};:'",<.>/?asdf blah blah blah blah As the file implies, save words that come after "add this word to the list:" to a list. Thank's again. |
Quote:
So what's the definition of "grand scheme of things" ? Are you saying that whenever one has a task to solve, he has to turn to Python/Perl/Ruby for the solution? Or what? I am not disapproving your notion of turning to these 3 languages (well known) for solving problems, but often than not, the "grand scheme of things" is dependent on the environment, and what tools one has to his disposal. Note: I did say that you can do it with awk as well (the key word is "as well"), but i did not say awk is the only thing that can solve the problem. |
Quote:
|
Quote:
you should stop imprinting that kind of "holy grail" thinking onto other people unaware of what's going on. By the way, i am curious. How do you actually measure and categorize "underlanguages"? It appears to me like its a scientific and proven technique. |
Quote:
And yes, Perl/Python/Ruby can become underlanguages. DOS batch language is definitely an underlanguage, and even though many many years ago I knew it somewhat, now I wouldn't consider learning it. For example, for Windows there is portable "Strawberry Perl", so if I need to do massive scripting under Windows, I'll use that Perl instead of DOS batch language. |
Quote:
Quote:
Quote:
Coming back to the main point of argument. You mentioned awk is an "underlanguage" and that later you mentioned you did not say its not suitable for parsing. I take it that you agree awk can do the job for this task (even though its under YOUR definition of "underlanguage"). So we can stop this useless argument already. right ? |
Quote:
The only place for underlanguages is systems with limited resources, like tiny embedded ones - not the case here. |
Quote:
Quote:
Quote:
|
@ Sergei & ghostdog - guys I realise that you both believe passionately in what you have to say but it seems that although loosely based on
this question you are more arguing with each other than helping the OP. Far be it for me to complain against either of you as I respect both of you in your given strengths and always read solutions that both of you post. Please let us just present the solutions we feel will work and then as with all things on LQ let the OP decide which option they prefer to follow :) If they are clever, they will give both the due merit as I know this is how I have been learning whilst participating in the forum. Cheers Grail |
Feynman - I know you provided in the first post the things you would like to achieve and in post #17 you provided some data. Perhaps you could show what, using the data provided, your output for each and or all steps would be?
|
Quote:
|
Quote:
|
Well, I do not know how to attach files. Please tell me how. In any case, I do not have very large files at this point. Actually, I was hoping in part I could use these scripts to feed the output of smaller files into the input of other programs--so and each output would contain more information to sift through.
[Bit of background here] For my purposes, I might calculate the properties of a few small molecules in parallel, have the scripts grab some portion of the data (which would be easily identifiable based on the structure of the output file the chemistry program generates) and concatenate it into a new input files that asks for information about how they would interact. Automating this process would be wonderfully useful. I suspect "professionals" already have these scripts and a strong knowledge of whatever language they were written in at hand, but I am still an undergrad and have much to learn about my computational recourses. I was hoping to put the final product on a website for free download and GNU usage. I suspect others like me will find it quite useful. Anyway, I can certainly copy and past some example input files I was starting out with (these came as tests for one of the chemistry packages I am working with). Give me a second to boot up my virtual Debian. I will post it in the next reply. |
Quote:
|
Quote:
And then there's my question which you have consistently avoided. If Perl/Python/Ruby one day is going to be called "underlanguages", are you going to advice people not to learn them? Still no answer from you? This answer will decide whether you are spouting crap or not. In your last few posts, you mentioned about embedded systems and that "underlanguages" are only used in those systems. So now i ask you, is learning "underlanguages" that worthless now? |
Ignore this post. See my next post with the attachment
|
Quote:
In the category of tightly coupled text parsing and related data processing Perl/Python/Ruby are clear winners over 'awk'. |
Quote:
|
1 Attachment(s)
Here is a typical output of a quantum chemistry package (GAMESS in this case). This is actually going to be the subject of my first study.
|
Quote:
But if you wanna think, think for starters about exporting data structures from 'awk' and importing them into 'awk'. |
Quote:
http://docstore.mik.ua/orelly/perl/cookbook/ch06_09.htm . |
Hmm.. Why is this thread marked [SOLVED] - I didn't note any particular solution, and the OP is still providing information and sample files recently. Are the arguing parties still trying to help the OP here? Perhaps the debate should be pruned off to another thread, and assisting the OP can resume (assuming the thread is not actually SOLVED - is it?)
|
Might I inquire which is easiest to learn?
|
Sorry, I marked it as solved earlier--before awk was mentioned. I figured my question was too vague to be answered thoroughly.
|
Quote:
easiest_to_learn * number_of_different_easiest_to_learn product. ... I looked at you data and it looks way too disorganized to me. I.e. my sensation is that the data is generated by quite a number of ad-hoc solutions with no clear architecture. |
Quote:
|
Quote:
|
Quote:
|
Funny you found that disorganized. This is a pretty highly regarded software and every quantum chemistry package I have used outputs data in this type of way.
The .dat file (also produced from a calculation) is more condensed, but it is essentially a chunk of the log file. I figured I would sift through the log file by default just in case I want to find something that is not in the dat file. Anyway, the key here is that there are landmarks in that gibberish. For instance, if I want the total energy of the molecule, I want the number imediatly following the phrase "FINAL RHF ENERGY IS". And you will notice that the file is broken up into these chunks of data. Each has its own grammar/syntax (am am mostly self taught so forgive me if I use some terms incorrectly) and some unique landmark denoting where it start, and if you look carefully, there is a "-------" that comes before and after each chunk of data. Looking for things like this would be a typical task in sifting through the results of a quantum chemistry package. I wanted to keep the scripts general so they would work not just for GAMESS, but most standard quantum chemistry packages. They all do this chunking thing. At this point, it seems if I have scripts that can perform those five tasks (actually, only four of them are needed--the last one is just a generalization of the third one), I should be able to extract just about any portion of any output generated by these packages. There are probably exceptions I have not thought of, but this would be an excellent start. |
Quote:
Then processing is performed on the data structure. For modularity/extensibility data structures are exported and imported by next consumers in the data processing chain. I've dealt with huge amounts of data - be it VLSI design, static timing analysis, VLSI verification, ASIC standard library cells characterization, acoustic modeling, whatever - the approach with data structures always works and is the book approach. |
Quote:
For SW to be good one needs competition - as everywhere else. I do not think quantum chemistry SW is widely used, so I do not expect competition in the field. There are well known and highly regarded data formats/approaches used in scientific calculations, for example, HDF: http://www.hdfgroup.org/ . |
Ok, I will rephrase that "easiest to learn" comment
Which language has commands/functions that are most naturally implemented to perform these tasks. For example: If awk has a find_the_first_word_after_this_string("Insert string here") command, or If perl has a grab_text_between_these_two_strings("string1", "string2") command, then it is quite easy to decent which language is best suited for which task. I am ignoring performance because it seems that no consensus is coming any time soon regarding that. In any case, the fact that two senior members cannot reach a consensus about it means to me that awk and perl have only marginal differences in performance. Hence I place my main priority on implementation. |
I am having trouble keeping up. Give me a moment to review all the posts. I missed one directly referring to GAMESS with a link to some kind of cookbook.
|
Quote:
Quote:
|
All times are GMT -5. The time now is 03:23 AM. |