[SOLVED] Confusion while using character sets in egrep

luvshines · 02-01-2011, 12:32 PM

Hi

I have this confusion on how exactly character sets work in REGEX patterns.

From what I read:

Code:

Character Classes

Character classes indicate a set of characters to match. Enclosing a set of characters in square brackets "[...]" means "match any one of these characters". For example:

Regex:
[cbe]at
Matches:
cat
bat
eat
Doesn't Match:
sat
beat
Since a character class on it's own only applies to one character in the match, combine it with a quantifier to search for multiple instances of the class

So I thought it will match only a single character. I tried using this

Code:

echo "1000" | egrep "[01]"

But this egrep pattern tends to match all the four character in 1000. I thought that it will match only first 1 which it finds since it had to match a single character

Why is it matching all 4 of them ?

xeleema · 02-01-2011, 01:11 PM

Greetingz!
That regex is going to give you the whole string it finds the match in. It's not limited to the beginning characters because you're not egrepping for the beginning characters.

P.S: This looks like a homework question...keep the LQ rules in mind.

luvshines · 02-01-2011, 03:17 PM

Quote:

Originally Posted by xeleema

Greetingz!
That regex is going to give you the whole string it finds the match in. It's not limited to the beginning characters because you're not egrepping for the beginning characters.

P.S: This looks like a homework question...keep the LQ rules in mind.

But I was under the impression that pattern matching starts from left and returns the first match only.
Just googled it and found a link which says similar http://www.regular-expressions.info/engine.html

Is this not the behavior ?

P.S: It is certainly not a homework question. I am working on a script(which will act as Samba plugin) and I have just posted the part from my script where I am stuck/confused

Tinkster · 02-01-2011, 06:00 PM

Given the egrep invocation you chose - how did you ascertain that it
matches all 4?

If you want a match to only match the first numeric character on a line
you'd have to go about it quite differently.

Code:

egrep "^[^01]*[01]"

But egrep will still return the whole LINE that matches.

To only get the first character that matches you'd probaly
want to use sed:

Code:

sed -r 's/^[^01]*([01])/\1/'

Cheers,
Tink

xeleema · 02-01-2011, 06:24 PM

Good find!

Okay, the link you've provided contains well-written information, however keep in mind that 'egrep' is going to report the whole string that a match was found in.

Case in point;

Code:

luser@lhost$ print "10 22 30 44 50\n100 222 300 444 500\n1000 2222 3000 4444 5000\n" | egrep "[85]"
10 22 30 44 50
100 222 300 444 500
1000 2222 3000 4444 5000
luser@lhost$ 

luser@lhost$ print "10 22 30 44 50\n100 222 300 444 500\n1000 2222 3000 4444 5000\n" | egrep "[0]"    
10 22 30 44 50
100 222 300 444 500
1000 2222 3000 4444 5000
luser@lhost$

Why not post your script (with [code] tags)? It'll give us a better idea of what's going on.
(If there's anything sensitve, you can redact that stuff)

grail · 02-01-2011, 09:54 PM

Would appear some of my esteemed colleagues above have forgotten about the '-o' switch to only display the matched item.
Now it also depends on exactly what details you want.

Code:

echo '1000' | egrep -o [01]

This now returns each match within your string so the output looks like:

Code:

If on the other hand we only want to know if the first character is a 0 or 1 then we go with:

Code:

$ echo '1000' | egrep -o '^[01]'
1

As you can see it now only matches the 1 at the start of the string.

I'll let you play from there and see if you follow

luvshines · 02-02-2011, 07:06 AM

Quote:

Originally Posted by grail

Would appear some of my esteemed colleagues above have forgotten about the '-o' switch to only display the matched item.
Now it also depends on exactly what details you want.

Code:

echo '1000' | egrep -o [01]

This now returns each match within your string so the output looks like:

Code:

If on the other hand we only want to know if the first character is a 0 or 1 then we go with:

Code:

$ echo '1000' | egrep -o '^[01]'
1

As you can see it now only matches the 1 at the start of the string.

I'll let you play from there and see if you follow

I used -o as pointed out by grail to see what was matched. Also, by defaut Ubuntu has --color=auto aliased for grep
So I was able to see that each of the characters were matched

So, if a word has multiple occurrences on a line, grep will find both of them. Is that correct ?
And if it that is correct, then my understanding(also as explained on that link) is wrong, that only the first match is returned

grail · 02-02-2011, 08:27 AM

I am guessing you are referring to the following paragraph from your link:

Quote:

The entire regular expression could be matched starting at character 15. The engine is "eager" to report a match. It will therefore report the first three letters of catfish as a valid match. The engine never proceeds beyond this point to see if there are any "better" matches. The first match is considered good enough.

Whilst poorly worded, the example given for the engine's process is correct. The engine will in no way find a better match than cat at the start of catfish. This does not imply,
although I agree again that it is worded poorly, that it will not match the word cat at the end of the sentence.

I guess another way to word the example is to say that the engine stops after matching cat in the sentence but that the application (in our case grep) tells the engine to start again from
the end of the last match. Hence it will start looking again for a 'c' until it finds it at the start of cat and then performs the rest of the tests to see if the rest matches.
Again the engine will stop at the end of the word cat as this is the best and only match possible based on the inputted regex (ie a literal string). Again grep will tell the engine to start again
as there is more text to be read in the the form of the period at the end of the sentence. The engine will again say that this does not match a 'c' and at this point grep will stop the engine
as it has reached the end of its input.

I hope that is a little clearer for you

luvshines · 02-03-2011, 04:21 AM

Ah, now that you have put it that way, it makes sense

So, it is basically grep/egrep which are making it(regex engine) repeat the search in the remaining string. Well, I didn't think that it could be application specific. Since 'sed' works only on first match and needs an explicit 'g' flag for working on multiple occurrences in the same line, thought grep should also work that way.

Thanks for the clarification guyz!!

Marking it SOLVED

PS: Is this behavior documented somewhere ??

grail · 02-03-2011, 04:56 AM

See Thread Tools to actually mark as SOLVED

As for documentation, I would just point you to the man pages, eg

Code:

NAME
       grep, egrep, fgrep, rgrep - print lines matching a pattern

--color[=WHEN], --colour[=WHEN]
              Surround the matched (non-empty) strings, matching lines, context lines, file names, line numbers, byte offsets, and separators (for fields and groups of context lines) with escape sequences to display them  in
              color  on  the  terminal.   The  colors  are defined by the environment variable GREP_COLORS.  The deprecated environment variable GREP_COLOR is still supported, but its setting does not have priority.  WHEN is
              never, always, or auto.

I put the --color part in to highlight that all matched strings are returned.