LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 05-25-2013, 10:27 AM   #1
mddnix
Member
 
Registered: Mar 2013
Location: Bangalore, India
Distribution: Redhat, Arch, Ubuntu
Posts: 512

Rep: Reputation: 139Reputation: 139
RegEx in grep confusion...


Hi

i always get confused when using regular expression. i'am new to shell scripting, and anytime i think i have acquired reasonable understanding of shell script, by this regex i get confused and i always forget everything.

Now i am back to basics of RegEx AGAIN...

Here is the search text: (From Unix Shells By Example 4th Ed)
Code:
$ cat datafile
northwest       NW      Charles Main            3.0     .98     3       34
western         WE      Sharon Gray             5.3     .97     5       23
southwest       SW      Lewis Dalsass           2.7     .8      2       18
southern        SO      Suan Chin               5.1     .95     4       15
southeast       SE      Patricia Hemenway       4.0     .7      4       17
eastern         EA      TB Savage               4.4     .84     5       20
northeast       NE      AM Main Jr.             5.1     .94     3       13
north           NO      Margot Weber            4.5     .89     5       9
central         CT      Ann Stephens            5.7     .94     5       13
I want to search just word which starts with 's' and ends with 'n'. In this example it is 'southern'.

I tried these...
Code:
$ grep '\<s.*n\>' datafile
southern        SO      Suan Chin                       5.1     .95     4       15

$ grep -w 's.*n' datafile
southern        SO      Suan Chin                       5.1     .95     4       15

$ grep '\bs.*n\b' datafile
southern        SO      Suan Chin                       5.1     .95     4       15
The result i get is "southern SO Suan Chin". i just want "southern" as result.

What i am doing wrong?

Thanks

Edit: In this case, i know by giving '\<s\w*\>' or '\bs\w*n\b' it will return 'southern'. but i think it will not return proper word, if for example the word is 'south#ern', it will fail.

Last edited by mddnix; 05-25-2013 at 11:48 AM.
 
Old 05-25-2013, 11:14 AM   #2
jlinkels
Senior Member
 
Registered: Oct 2003
Location: Bonaire
Distribution: Debian Wheezy/Jessie/Sid, Linux Mint DE
Posts: 4,493

Rep: Reputation: 635Reputation: 635Reputation: 635Reputation: 635Reputation: 635Reputation: 635
You are mixing matching and output.

Regular expressions are greedy, so the regular expression s.*n matches southern SO Suan Chin. It would have matched southern foo bar as well.

With \bs.*n\b you made sure that the match would be on southern, and not on southern SO Suan Chin.

But it doesn't make any difference, because grep outputs the entire line where a match is found. You will always get the entire line, not the matching word.

You'd better use sed for this:
Code:
cat file.txt | sed -n 's/\(^s[a-z]*n\).*/\1/p'
What it does:
-n: don't print until a match is found
\(...\): create a match which is referred to by \1
^: only the first word
s[a-z]*n: must be characters starting with 's', and arbitrary number of a-z and ending with 'n'
.*: make sure the line is matched
\1: print only the first match
p: really do print if a match is found

Please remember that I am not a sed expert, there are member here who can output a Shakespeare sonnet writing down 20 characters of sed code.

Study this part in case you want to understand it, or modify your match:
http://www.grymoire.com/Unix/Sed.html#uh-4
(This is the best sed guide ever I have found)

jlinkels

Last edited by jlinkels; 05-25-2013 at 11:17 AM.
 
1 members found this post helpful.
Old 05-25-2013, 11:36 AM   #3
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,253

Rep: Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686
.* is greedy, so try thinking of the negative before your last character and then adding it.

Also, by default grep returns the line on which your match was found and not just the match. Try the -o option
 
Old 05-25-2013, 11:37 AM   #4
mddnix
Member
 
Registered: Mar 2013
Location: Bangalore, India
Distribution: Redhat, Arch, Ubuntu
Posts: 512

Original Poster
Rep: Reputation: 139Reputation: 139
@jlinkels thanks for the reply.

Quote:
With \bs.*n\b you made sure that the match would be on southern, and not on southern SO Suan Chin.

But it doesn't make any difference, because grep outputs the entire line where a match is found. You will always get the entire line, not the matching word.
It does not match 'southern' but instead matches 'southern SO Suan Chin' in full.

Code:
$ grep --color=always '\bs.*n\b' datafile
southern        SO      Suan Chin                       5.1     .95     4       15
Quote:
Study this part in case you want to understand it, or modify your match:
http://www.grymoire.com/Unix/Sed.html#uh-4
(This is the best sed guide ever I have found)
Thanks for the link. its truly great. i'll definitely go through that.

Thanks

Last edited by mddnix; 05-25-2013 at 11:46 AM.
 
Old 05-25-2013, 11:42 AM   #5
mddnix
Member
 
Registered: Mar 2013
Location: Bangalore, India
Distribution: Redhat, Arch, Ubuntu
Posts: 512

Original Poster
Rep: Reputation: 139Reputation: 139
@grail

Code:
$ grep -o '\bs.*n\b' datafile
southern        SO      Suan Chin
still no luck...

Thanks
 
Old 05-25-2013, 01:18 PM   #6
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,253

Rep: Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686
You only read half my reply. Try reading the first part again.
 
Old 05-25-2013, 01:40 PM   #7
mddnix
Member
 
Registered: Mar 2013
Location: Bangalore, India
Distribution: Redhat, Arch, Ubuntu
Posts: 512

Original Poster
Rep: Reputation: 139Reputation: 139
Quote:
Originally Posted by grail View Post
.* is greedy, so try thinking of the negative before your last character and then adding it.
i did read it... and if .* is greedy and that it picks up every thing, then what is the alternative?

let me rephrase my question - forget grep, lets assume I have file (some txt story book) and it contains words like following in any order.

Quote:
salesman
sanction
siren
spoon
sworn
spiderman
sixteen
sin
sweet16on
sweat'in
I want to return lines that match above words.

What search pattern should I give?

Last edited by mddnix; 05-25-2013 at 01:47 PM.
 
Old 05-25-2013, 02:01 PM   #8
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,253

Rep: Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686Reputation: 2686
Code:
grep -o '\bs[^n]*n\b' file
 
1 members found this post helpful.
Old 05-25-2013, 02:19 PM   #9
mddnix
Member
 
Registered: Mar 2013
Location: Bangalore, India
Distribution: Redhat, Arch, Ubuntu
Posts: 512

Original Poster
Rep: Reputation: 139Reputation: 139
Code:
'\bs[^n]*n\b'
Great!!! this did the job.

I truly appreciate your time and effort on this.
 
Old 05-26-2013, 08:23 AM   #10
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957
Incidentally, gnu grep also offers the -P option, which allows the use of Perl-Compatible Regular Expressions. And in PCRE, you can turn off greediness by following the normally-greedy token with a question mark. Then it will return the shortest possible match instead of the longest.

Code:
grep -oP '\bs.*?n\b' infile
And here's a version you can use that will only match whole words that start with s and end with n, even if there's another n inside them (e.g. 'sundown').

Code:
grep -oE '\bs\w*n\b' infile
'\w' is a synonym for a "word" character, that is "a-zA-Z0-9_". So it will fail to match if there are any non-word characters between the s and n.

You could also do some clever stuff with \B, which is the inverse of \b, and will only match the zero-width space between two word (or two non-word) characters.


Finally, though, since grep operates in a line-wise fashion, even with the -o option it will still output every matching substring on the line, if there are multiples. You can't make it stop at the first instance (unless the expression is anchored with ^/$), although you can make it stop after the first line, with the -m option.

For finer control when working with column-delimited text like this, you'll probably want to use awk instead.

Code:
awk '$1 ~ /^s.*n$/ { print $1 }' infile
 
1 members found this post helpful.
Old 05-26-2013, 09:52 AM   #11
mddnix
Member
 
Registered: Mar 2013
Location: Bangalore, India
Distribution: Redhat, Arch, Ubuntu
Posts: 512

Original Poster
Rep: Reputation: 139Reputation: 139
@David the H.

Thanks for answering in detail.

I found '\bs.*?n\b' very helpful as it return all the words between 's' and 'n' like 'southern', 'south9ern' and 'south#ern'. Whereas '\bs\w*n\b' will only return 'southern' and 'south9ern', which is natural as \w = [a-zA-Z0-9_].

Also, both -E and -P returns same results and that in man page -P is said to be experimental, so I find -E will be safe to use.
 
Old 05-26-2013, 08:07 PM   #12
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.8, Centos 5.10
Posts: 17,240

Rep: Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324
I would disagree slightly with this definition
Code:
^: only the first word
'^' means anchor the match to start at the beginning of the string being tested
This the book if you really want to know regex http://regex.info/book.html - highly recommended
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
REGEX with grep Anna1987 Linux - Newbie 9 03-09-2013 08:56 AM
Condition in cp/ls | grep (regex, now I have two problems) Freddythunder Linux - Newbie 6 07-06-2012 09:39 AM
grep regex with wildcard yknot7 Linux - Newbie 6 11-10-2011 02:11 AM
[SOLVED] Combining regex in grep devUnix Linux - General 2 09-06-2011 12:11 PM
regex in ls vs. grep jhwilliams Linux - Software 2 08-10-2007 11:14 PM


All times are GMT -5. The time now is 12:58 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration