LinuxQuestions.org
Help answer threads with 0 replies.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 02-01-2011, 12:32 PM   #1
luvshines
Member
 
Registered: Apr 2009
Posts: 74

Rep: Reputation: 16
Question Confusion while using character sets in egrep


Hi

I have this confusion on how exactly character sets work in REGEX patterns.

From what I read:
Code:
Character Classes

Character classes indicate a set of characters to match. Enclosing a set of characters in square brackets "[...]" means "match any one of these characters". For example:

Regex:
[cbe]at
Matches:
cat
bat
eat
Doesn't Match:
sat
beat
Since a character class on it's own only applies to one character in the match, combine it with a quantifier to search for multiple instances of the class
So I thought it will match only a single character. I tried using this
Code:
echo "1000" | egrep "[01]"
But this egrep pattern tends to match all the four character in 1000. I thought that it will match only first 1 which it finds since it had to match a single character

Why is it matching all 4 of them ?

Last edited by luvshines; 02-01-2011 at 12:33 PM. Reason: Typos
 
Old 02-01-2011, 01:11 PM   #2
xeleema
Member
 
Registered: Aug 2005
Location: D.i.t.h.o, Texas
Distribution: Slackware 13.x, rhel3/5, Solaris 8-10(sparc), HP-UX 11.x (pa-risc)
Posts: 988
Blog Entries: 4

Rep: Reputation: 254Reputation: 254Reputation: 254
Greetingz!
That regex is going to give you the whole string it finds the match in. It's not limited to the beginning characters because you're not egrepping for the beginning characters.

P.S: This looks like a homework question...keep the LQ rules in mind.
 
Old 02-01-2011, 03:17 PM   #3
luvshines
Member
 
Registered: Apr 2009
Posts: 74

Original Poster
Rep: Reputation: 16
Quote:
Originally Posted by xeleema View Post
Greetingz!
That regex is going to give you the whole string it finds the match in. It's not limited to the beginning characters because you're not egrepping for the beginning characters.

P.S: This looks like a homework question...keep the LQ rules in mind.
But I was under the impression that pattern matching starts from left and returns the first match only.
Just googled it and found a link which says similar http://www.regular-expressions.info/engine.html

Is this not the behavior ?

P.S: It is certainly not a homework question. I am working on a script(which will act as Samba plugin) and I have just posted the part from my script where I am stuck/confused
 
Old 02-01-2011, 06:00 PM   #4
Tinkster
Moderator
 
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
Blog Entries: 11

Rep: Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928
Given the egrep invocation you chose - how did you ascertain that it
matches all 4?

If you want a match to only match the first numeric character on a line
you'd have to go about it quite differently.

Code:
egrep "^[^01]*[01]"
But egrep will still return the whole LINE that matches.

To only get the first character that matches you'd probaly
want to use sed:

Code:
sed -r 's/^[^01]*([01])/\1/'



Cheers,
Tink

Last edited by Tinkster; 02-01-2011 at 07:14 PM. Reason: meh - typo, code-tags
 
Old 02-01-2011, 06:24 PM   #5
xeleema
Member
 
Registered: Aug 2005
Location: D.i.t.h.o, Texas
Distribution: Slackware 13.x, rhel3/5, Solaris 8-10(sparc), HP-UX 11.x (pa-risc)
Posts: 988
Blog Entries: 4

Rep: Reputation: 254Reputation: 254Reputation: 254
Good find!

Okay, the link you've provided contains well-written information, however keep in mind that 'egrep' is going to report the whole string that a match was found in.

Case in point;
Code:
luser@lhost$ print "10 22 30 44 50\n100 222 300 444 500\n1000 2222 3000 4444 5000\n" | egrep "[85]"
10 22 30 44 50
100 222 300 444 500
1000 2222 3000 4444 5000
luser@lhost$ 

luser@lhost$ print "10 22 30 44 50\n100 222 300 444 500\n1000 2222 3000 4444 5000\n" | egrep "[0]"    
10 22 30 44 50
100 222 300 444 500
1000 2222 3000 4444 5000
luser@lhost$
Why not post your script (with [code] tags)? It'll give us a better idea of what's going on.
(If there's anything sensitve, you can redact that stuff)
 
Old 02-01-2011, 09:54 PM   #6
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,007

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
Would appear some of my esteemed colleagues above have forgotten about the '-o' switch to only display the matched item.
Now it also depends on exactly what details you want.
Code:
echo '1000' | egrep -o [01]
This now returns each match within your string so the output looks like:
Code:
1
0
0
0
If on the other hand we only want to know if the first character is a 0 or 1 then we go with:
Code:
$ echo '1000' | egrep -o '^[01]'
1
As you can see it now only matches the 1 at the start of the string.

I'll let you play from there and see if you follow
 
Old 02-02-2011, 07:06 AM   #7
luvshines
Member
 
Registered: Apr 2009
Posts: 74

Original Poster
Rep: Reputation: 16
Quote:
Originally Posted by grail View Post
Would appear some of my esteemed colleagues above have forgotten about the '-o' switch to only display the matched item.
Now it also depends on exactly what details you want.
Code:
echo '1000' | egrep -o [01]
This now returns each match within your string so the output looks like:
Code:
1
0
0
0
If on the other hand we only want to know if the first character is a 0 or 1 then we go with:
Code:
$ echo '1000' | egrep -o '^[01]'
1
As you can see it now only matches the 1 at the start of the string.

I'll let you play from there and see if you follow
I used -o as pointed out by grail to see what was matched. Also, by defaut Ubuntu has --color=auto aliased for grep
So I was able to see that each of the characters were matched

So, if a word has multiple occurrences on a line, grep will find both of them. Is that correct ?
And if it that is correct, then my understanding(also as explained on that link) is wrong, that only the first match is returned
 
Old 02-02-2011, 08:27 AM   #8
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,007

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
I am guessing you are referring to the following paragraph from your link:
Quote:
The entire regular expression could be matched starting at character 15. The engine is "eager" to report a match. It will therefore report the first three letters of catfish as a valid match. The engine never proceeds beyond this point to see if there are any "better" matches. The first match is considered good enough.
Whilst poorly worded, the example given for the engine's process is correct. The engine will in no way find a better match than cat at the start of catfish. This does not imply,
although I agree again that it is worded poorly, that it will not match the word cat at the end of the sentence.

I guess another way to word the example is to say that the engine stops after matching cat in the sentence but that the application (in our case grep) tells the engine to start again from
the end of the last match. Hence it will start looking again for a 'c' until it finds it at the start of cat and then performs the rest of the tests to see if the rest matches.
Again the engine will stop at the end of the word cat as this is the best and only match possible based on the inputted regex (ie a literal string). Again grep will tell the engine to start again
as there is more text to be read in the the form of the period at the end of the sentence. The engine will again say that this does not match a 'c' and at this point grep will stop the engine
as it has reached the end of its input.

I hope that is a little clearer for you
 
Old 02-03-2011, 04:21 AM   #9
luvshines
Member
 
Registered: Apr 2009
Posts: 74

Original Poster
Rep: Reputation: 16
Thumbs up

Ah, now that you have put it that way, it makes sense

So, it is basically grep/egrep which are making it(regex engine) repeat the search in the remaining string. Well, I didn't think that it could be application specific. Since 'sed' works only on first match and needs an explicit 'g' flag for working on multiple occurrences in the same line, thought grep should also work that way.

Thanks for the clarification guyz!!

Marking it SOLVED

PS: Is this behavior documented somewhere ??
 
Old 02-03-2011, 04:56 AM   #10
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,007

Rep: Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191Reputation: 3191
See Thread Tools to actually mark as SOLVED

As for documentation, I would just point you to the man pages, eg
Code:
NAME
       grep, egrep, fgrep, rgrep - print lines matching a pattern

--color[=WHEN], --colour[=WHEN]
              Surround the matched (non-empty) strings, matching lines, context lines, file names, line numbers, byte offsets, and separators (for fields and groups of context lines) with escape sequences to display them  in
              color  on  the  terminal.   The  colors  are defined by the environment variable GREP_COLORS.  The deprecated environment variable GREP_COLOR is still supported, but its setting does not have priority.  WHEN is
              never, always, or auto.
I put the --color part in to highlight that all matched strings are returned.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Problem with ncurses and character sets/keys galapogos Programming 3 05-07-2008 02:00 AM
Webmail and other character sets nextekcarl Linux - General 0 04-28-2007 11:58 AM
Samba character sets 1337_penguin Linux - Networking 0 03-22-2007 05:25 PM
changing character sets on my console?(SuSe 7.3) Fin7PL SUSE / openSUSE 1 02-27-2006 08:36 AM
International Character sets in Pine / Konsole LauraK6 Linux - Software 0 10-28-2004 10:56 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 12:32 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration