LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices

Reply
 
Search this Thread
Old 08-22-2010, 08:23 PM   #1
Feynman
Member
 
Registered: Aug 2010
Distribution: Gentoo
Posts: 62

Rep: Reputation: 15
Question How to extract the first word following a string


I need a bash script that can read a file, say example.txt search for the string "This is my example string" and save whatever word/number comes immediately after it to a variable, var.

Example:
blah blah
blah This is my example string extracthere is a very nice word.
blah blah

There are two constraints:

1. This needs to assume as little as possible about the nature of the known string "This is my example string" and the word that follows it. I am trying to keep my code adaptable.

2. Speed is valuable. This shell will be executed dozens if not hundreds of times so speed is very desirable. I thought I read that some commands are faster than others.

Could someone help me devise this script?
Thanks
 
Old 08-22-2010, 08:33 PM   #2
Tinkster
Moderator
 
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :} ... android.
Posts: 22,986
Blog Entries: 11

Rep: Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880
Hi, welcome to LQ!

In terms of tools the best bet will be sed ... namely the s/xxx/yyy/ feature.



Cheers,
Tink
 
Old 08-22-2010, 10:50 PM   #3
Feynman
Member
 
Registered: Aug 2010
Distribution: Gentoo
Posts: 62

Original Poster
Rep: Reputation: 15
Thanks for the quick response. So I assume you are suggesting I use sed to replace "This is my example string" with "" and use awk to get the first line. I see a few forums saying sed is significantly slower than grep (just google sed slow). Is this a legitimate concern?
 
Old 08-22-2010, 11:23 PM   #4
Tinkster
Moderator
 
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :} ... android.
Posts: 22,986
Blog Entries: 11

Rep: Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880
Quote:
Thanks for the quick response. So I assume you are suggesting I use sed to replace "This is my example string" with "" and use awk to get the first line. I see a few forums saying sed is significantly slower than grep (just google sed slow). Is this a legitimate concern?
I'd personally just use sed in this case; can't say I've had
performance issues with sed.

And any other approach - apart from (in pseudo code):
Code:
Remember the word following my search-pattern, and
replace the whole line with it
will have to be slower, because it's going to involve
more than one tool (of course, you could use perl
or awk, but they're bigger than sed, and not necessarily
faster - of course your mileage may vary).

Code:
sed -nr "s/.*my pattern (\w+).*/\1/p" my_file

Now - while grep may be faster at simply finding
a line that matches your pattern, extracting the
word following it won't be possible w/o other
tools anyway.


Cheers,
Tink

Last edited by Tinkster; 08-22-2010 at 11:25 PM.
 
1 members found this post helpful.
Old 08-23-2010, 03:29 AM   #5
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,562

Rep: Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939
Or you could go all bash:
Code:
while read -r line
do
    [[ $line =~ "This is my example string" ]] && var=${line%% *}
done< file

echo "$var"
 
1 members found this post helpful.
Old 08-23-2010, 10:03 AM   #6
Feynman
Member
 
Registered: Aug 2010
Distribution: Gentoo
Posts: 62

Original Poster
Rep: Reputation: 15
I tried testing the sed command in terminal, but it does not show the result. How do I see if it worked? I tried:
sed -nr "s/.*HEAT OF FORMATION (\w+).*/\1/p" /opt/gamess/tests/exam12.inp | echo
and
var= sed -nr "s/.*HEAT OF FORMATION (\w+).*/\1/p" /opt/gamess/tests/exam12.inp
echo $var

but neither gave me a result--just a blank line. The same goes for the while loop. I think the problem is the word following HEAT OF FORMATION is separated by several spaces from the next word.

This will be quite common for my purposes (I am trying to extract data from output files.)

Last edited by Feynman; 08-23-2010 at 10:09 AM.
 
Old 08-23-2010, 10:11 AM   #7
crts
Senior Member
 
Registered: Jan 2010
Posts: 1,604

Rep: Reputation: 446Reputation: 446Reputation: 446Reputation: 446Reputation: 446
Hi,

try it this way
Code:
var=$(sed -nr "s/.*HEAT OF FORMATION (\w+).*/\1/p" /opt/gamess/tests/exam12.inp)
Notice that there is no space after '='.
[EDIT]
Can you confirm that the character after the search pattern will always be a whitespace and not, e. g. a tab character?
If so, you can try
Code:
sed -nr "s/.*${pattern}[      ]*(\w+).*/\1/p" input.file > output.file
The big space inside [] is a normal space followed by a tab.

Last edited by crts; 08-23-2010 at 10:26 AM. Reason: code tags
 
1 members found this post helpful.
Old 08-23-2010, 10:21 AM   #8
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,562

Rep: Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939
Well I am not really surprised that my loop did not work seeing as you completely changed which word you were looking for??

Your example shows that you wished the word prior to the string, whereas your post #6 shows you want the word after.
Kind of a big difference. Sed is probably more suited to what you want but bash could still do it if you show me the correct
input and the actual output required?
 
Old 08-23-2010, 10:47 AM   #9
Feynman
Member
 
Registered: Aug 2010
Distribution: Gentoo
Posts: 62

Original Poster
Rep: Reputation: 15
I tried
var=$(sed -nr "s/.*HEAT OF FORMATION (\w+).*/\1/p" /opt/gamess/tests/exam12.inp)
echo $var
I am pretty sure it is just a bunch of spaces because I can move my cursor through each of them.
How do I type a tab character in terminal? Pressing tab does not work. I tried copy-pasting your code:
sed -nr "s/.*${HEAT OF FORMATION}[ ]*(\w+).*/\1/p" /opt/gamess/tests/exam12.inp > test2
but terminal complained:
bash: s/.*${HEAT OF FORMATION}[ ]*(\w+).*/\1/p: bad substitution
I got the same thing when putting HEAT OF FORMATION in quotes.

The correct output should be 105.14088

I tried typing the input by the forum seems to automatically take out extra spaces (obviously ctrs found a way around that). Here is what the particular line would look like:
![3 spaces]Heat OF FORMATION[9 spaces]105.14088[3 spaces]93.45997[3 spaces]46.89387

But I would rather not have the script explicitly depend on the number of spaces because I was hoping to use it for other forms of files (generated by different programs).

Last edited by Feynman; 08-23-2010 at 10:53 AM. Reason: tried the tab thing
 
Old 08-23-2010, 11:02 AM   #10
Feynman
Member
 
Registered: Aug 2010
Distribution: Gentoo
Posts: 62

Original Poster
Rep: Reputation: 15
I am running Debian Lenny on a virtual machine using virtual box with a mac Leopard as the host. I say that on the off chance that would make any difference.
 
Old 08-23-2010, 11:12 AM   #11
konsolebox
Senior Member
 
Registered: Oct 2005
Distribution: Gentoo, Slackware, LFS
Posts: 2,245
Blog Entries: 15

Rep: Reputation: 233Reputation: 233Reputation: 233
(.*) should be \(.*\)... inside double quotes, it should be "\\(\\)"

Last edited by konsolebox; 08-23-2010 at 11:13 AM.
 
Old 08-23-2010, 11:41 AM   #12
Feynman
Member
 
Registered: Aug 2010
Distribution: Gentoo
Posts: 62

Original Poster
Rep: Reputation: 15
I am sorry but which .* are you reffering to? I understand the basics of text parsing with cat, sed, awk, tr, and grep, but I do not have any idea how to use all these symbols: # % ^ * / \ and their numerous combinations. They are never mentioned in the manual. If you know of any tutorials that could teach me about all these symbols please let me know.
 
Old 08-23-2010, 12:03 PM   #13
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,562

Rep: Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939
Try this:
Code:
 var=$(sed -nr 's/.*HEAT OF FORMATION +([^ ]+) .*/\1/p' file)
 
Old 08-23-2010, 12:08 PM   #14
crts
Senior Member
 
Registered: Jan 2010
Posts: 1,604

Rep: Reputation: 446Reputation: 446Reputation: 446Reputation: 446Reputation: 446
Quote:
Originally Posted by Feynman View Post
I tried
var=$(sed -nr "s/.*HEAT OF FORMATION (\w+).*/\1/p" /opt/gamess/tests/exam12.inp)
echo $var
I am pretty sure it is just a bunch of spaces because I can move my cursor through each of them.
How do I type a tab character in terminal? Pressing tab does not work. I tried copy-pasting your code:
sed -nr "s/.*${HEAT OF FORMATION}[ ]*(\w+).*/\1/p" /opt/gamess/tests/exam12.inp > test2
but terminal complained:
bash: s/.*${HEAT OF FORMATION}[ ]*(\w+).*/\1/p: bad substitution
I got the same thing when putting HEAT OF FORMATION in quotes.

The correct output should be 105.14088

I tried typing the input by the forum seems to automatically take out extra spaces (obviously ctrs found a way around that). Here is what the particular line would look like:
![3 spaces]Heat OF FORMATION[9 spaces]105.14088[3 spaces]93.45997[3 spaces]46.89387

But I would rather not have the script explicitly depend on the number of spaces because I was hoping to use it for other forms of files (generated by different programs).
Ok, since you previously stated it should be as general as possible, I used a variable to hold the pattern. Sorry I didn't make that clear.
Code:
pattern="HEAT OF FORMATION"
sed -nr "s/.*${pattern}[      ]*(\w+).*/\1/p" /opt/gamess/tests/exam12.inp > test2
Now the use of ${pattern} will be expanded to
HEAT OF FORMATION
by the shell. The [ ]* statement will take any number of spaces and tabs into account. So this should be no problem. Just assign your search pattern to the variable pattern as described above before you issue the sed command.
@konsolebox: I am also not quite sure what you mean. Since sed uses the -r switch for extended regex there should be no need to escape the braces.
 
Old 08-23-2010, 12:18 PM   #15
Feynman
Member
 
Registered: Aug 2010
Distribution: Gentoo
Posts: 62

Original Poster
Rep: Reputation: 15
ALMOST!! That gave me the first two words that follows "HEAT OF FORMATION". But then when I try "HEAT OF" (neither of these have quotes when I enter them in the terminal of course) I get
FORMATION FORMATION

????
 
  


Reply

Tags
extract, parse, string, text, word


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
grep string with space (2 word string) casperdaghost Linux - Newbie 7 08-24-2009 02:11 AM
extract text only from ms word in php sriphp Linux - Newbie 2 05-11-2009 01:07 AM
variable length string using GD (word wrap, carriage return, word/character count)? frieza Programming 1 02-14-2009 05:21 PM
How to extract Data from word document? nesta Programming 3 11-26-2008 11:35 AM
scripting question: Extract a particular word from /proc/cmdline kushalkoolwal Programming 3 05-14-2008 02:48 AM


All times are GMT -5. The time now is 12:37 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration