[SOLVED] How to extract the first word following a string
Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
I need a bash script that can read a file, say example.txt search for the string "This is my example string" and save whatever word/number comes immediately after it to a variable, var.
Example:
blah blah
blah This is my example string extracthere is a very nice word.
blah blah
There are two constraints:
1. This needs to assume as little as possible about the nature of the known string "This is my example string" and the word that follows it. I am trying to keep my code adaptable.
2. Speed is valuable. This shell will be executed dozens if not hundreds of times so speed is very desirable. I thought I read that some commands are faster than others.
Thanks for the quick response. So I assume you are suggesting I use sed to replace "This is my example string" with "" and use awk to get the first line. I see a few forums saying sed is significantly slower than grep (just google sed slow). Is this a legitimate concern?
Thanks for the quick response. So I assume you are suggesting I use sed to replace "This is my example string" with "" and use awk to get the first line. I see a few forums saying sed is significantly slower than grep (just google sed slow). Is this a legitimate concern?
I'd personally just use sed in this case; can't say I've had
performance issues with sed.
And any other approach - apart from (in pseudo code):
Code:
Remember the word following my search-pattern, and
replace the whole line with it
will have to be slower, because it's going to involve
more than one tool (of course, you could use perl
or awk, but they're bigger than sed, and not necessarily
faster - of course your mileage may vary).
Code:
sed -nr "s/.*my pattern (\w+).*/\1/p" my_file
Now - while grep may be faster at simply finding
a line that matches your pattern, extracting the
word following it won't be possible w/o other
tools anyway.
I tried testing the sed command in terminal, but it does not show the result. How do I see if it worked? I tried:
sed -nr "s/.*HEAT OF FORMATION (\w+).*/\1/p" /opt/gamess/tests/exam12.inp | echo
and
var= sed -nr "s/.*HEAT OF FORMATION (\w+).*/\1/p" /opt/gamess/tests/exam12.inp
echo $var
but neither gave me a result--just a blank line. The same goes for the while loop. I think the problem is the word following HEAT OF FORMATION is separated by several spaces from the next word.
This will be quite common for my purposes (I am trying to extract data from output files.)
var=$(sed -nr "s/.*HEAT OF FORMATION (\w+).*/\1/p" /opt/gamess/tests/exam12.inp)
Notice that there is no space after '='.
[EDIT]
Can you confirm that the character after the search pattern will always be a whitespace and not, e. g. a tab character?
If so, you can try
Code:
sed -nr "s/.*${pattern}[ ]*(\w+).*/\1/p" input.file > output.file
The big space inside [] is a normal space followed by a tab.
Last edited by crts; 08-23-2010 at 10:26 AM.
Reason: code tags
Well I am not really surprised that my loop did not work seeing as you completely changed which word you were looking for??
Your example shows that you wished the word prior to the string, whereas your post #6 shows you want the word after.
Kind of a big difference. Sed is probably more suited to what you want but bash could still do it if you show me the correct
input and the actual output required?
I tried
var=$(sed -nr "s/.*HEAT OF FORMATION (\w+).*/\1/p" /opt/gamess/tests/exam12.inp)
echo $var
I am pretty sure it is just a bunch of spaces because I can move my cursor through each of them.
How do I type a tab character in terminal? Pressing tab does not work. I tried copy-pasting your code:
sed -nr "s/.*${HEAT OF FORMATION}[ ]*(\w+).*/\1/p" /opt/gamess/tests/exam12.inp > test2
but terminal complained:
bash: s/.*${HEAT OF FORMATION}[ ]*(\w+).*/\1/p: bad substitution
I got the same thing when putting HEAT OF FORMATION in quotes.
The correct output should be 105.14088
I tried typing the input by the forum seems to automatically take out extra spaces (obviously ctrs found a way around that). Here is what the particular line would look like:
![3 spaces]Heat OF FORMATION[9 spaces]105.14088[3 spaces]93.45997[3 spaces]46.89387
But I would rather not have the script explicitly depend on the number of spaces because I was hoping to use it for other forms of files (generated by different programs).
Last edited by Feynman; 08-23-2010 at 10:53 AM.
Reason: tried the tab thing
I am running Debian Lenny on a virtual machine using virtual box with a mac Leopard as the host. I say that on the off chance that would make any difference.
I am sorry but which .* are you reffering to? I understand the basics of text parsing with cat, sed, awk, tr, and grep, but I do not have any idea how to use all these symbols: # % ^ * / \ and their numerous combinations. They are never mentioned in the manual. If you know of any tutorials that could teach me about all these symbols please let me know.
I tried
var=$(sed -nr "s/.*HEAT OF FORMATION (\w+).*/\1/p" /opt/gamess/tests/exam12.inp)
echo $var
I am pretty sure it is just a bunch of spaces because I can move my cursor through each of them.
How do I type a tab character in terminal? Pressing tab does not work. I tried copy-pasting your code:
sed -nr "s/.*${HEAT OF FORMATION}[ ]*(\w+).*/\1/p" /opt/gamess/tests/exam12.inp > test2
but terminal complained:
bash: s/.*${HEAT OF FORMATION}[ ]*(\w+).*/\1/p: bad substitution
I got the same thing when putting HEAT OF FORMATION in quotes.
The correct output should be 105.14088
I tried typing the input by the forum seems to automatically take out extra spaces (obviously ctrs found a way around that). Here is what the particular line would look like:
![3 spaces]Heat OF FORMATION[9 spaces]105.14088[3 spaces]93.45997[3 spaces]46.89387
But I would rather not have the script explicitly depend on the number of spaces because I was hoping to use it for other forms of files (generated by different programs).
Ok, since you previously stated it should be as general as possible, I used a variable to hold the pattern. Sorry I didn't make that clear.
Code:
pattern="HEAT OF FORMATION"
sed -nr "s/.*${pattern}[ ]*(\w+).*/\1/p" /opt/gamess/tests/exam12.inp > test2
Now the use of ${pattern} will be expanded to
HEAT OF FORMATION
by the shell. The [ ]* statement will take any number of spaces and tabs into account. So this should be no problem. Just assign your search pattern to the variable pattern as described above before you issue the sed command.
@konsolebox: I am also not quite sure what you mean. Since sed uses the -r switch for extended regex there should be no need to escape the braces.
ALMOST!! That gave me the first two words that follows "HEAT OF FORMATION". But then when I try "HEAT OF" (neither of these have quotes when I enter them in the terminal of course) I get
FORMATION FORMATION
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.