[SOLVED] Confusion while using character sets in egrep
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I have this confusion on how exactly character sets work in REGEX patterns.
From what I read:
Code:
Character Classes
Character classes indicate a set of characters to match. Enclosing a set of characters in square brackets "[...]" means "match any one of these characters". For example:
Regex:
[cbe]at
Matches:
cat
bat
eat
Doesn't Match:
sat
beat
Since a character class on it's own only applies to one character in the match, combine it with a quantifier to search for multiple instances of the class
So I thought it will match only a single character. I tried using this
Code:
echo "1000" | egrep "[01]"
But this egrep pattern tends to match all the four character in 1000. I thought that it will match only first 1 which it finds since it had to match a single character
Why is it matching all 4 of them ?
Last edited by luvshines; 02-01-2011 at 12:33 PM.
Reason: Typos
Greetingz!
That regex is going to give you the whole string it finds the match in. It's not limited to the beginning characters because you're not egrepping for the beginning characters.
P.S: This looks like a homework question...keep the LQ rules in mind.
Greetingz!
That regex is going to give you the whole string it finds the match in. It's not limited to the beginning characters because you're not egrepping for the beginning characters.
P.S: This looks like a homework question...keep the LQ rules in mind.
But I was under the impression that pattern matching starts from left and returns the first match only.
Just googled it and found a link which says similar http://www.regular-expressions.info/engine.html
Is this not the behavior ?
P.S: It is certainly not a homework question. I am working on a script(which will act as Samba plugin) and I have just posted the part from my script where I am stuck/confused
Okay, the link you've provided contains well-written information, however keep in mind that 'egrep' is going to report the whole string that a match was found in.
Would appear some of my esteemed colleagues above have forgotten about the '-o' switch to only display the matched item.
Now it also depends on exactly what details you want.
Code:
echo '1000' | egrep -o [01]
This now returns each match within your string so the output looks like:
Code:
1
0
0
0
If on the other hand we only want to know if the first character is a 0 or 1 then we go with:
Code:
$ echo '1000' | egrep -o '^[01]'
1
As you can see it now only matches the 1 at the start of the string.
I'll let you play from there and see if you follow
Would appear some of my esteemed colleagues above have forgotten about the '-o' switch to only display the matched item.
Now it also depends on exactly what details you want.
Code:
echo '1000' | egrep -o [01]
This now returns each match within your string so the output looks like:
Code:
1
0
0
0
If on the other hand we only want to know if the first character is a 0 or 1 then we go with:
Code:
$ echo '1000' | egrep -o '^[01]'
1
As you can see it now only matches the 1 at the start of the string.
I'll let you play from there and see if you follow
I used -o as pointed out by grail to see what was matched. Also, by defaut Ubuntu has --color=auto aliased for grep
So I was able to see that each of the characters were matched
So, if a word has multiple occurrences on a line, grep will find both of them. Is that correct ?
And if it that is correct, then my understanding(also as explained on that link) is wrong, that only the first match is returned
I am guessing you are referring to the following paragraph from your link:
Quote:
The entire regular expression could be matched starting at character 15. The engine is "eager" to report a match. It will therefore report the first three letters of catfish as a valid match. The engine never proceeds beyond this point to see if there are any "better" matches. The first match is considered good enough.
Whilst poorly worded, the example given for the engine's process is correct. The engine will in no way find a better match than cat at the start of catfish. This does not imply,
although I agree again that it is worded poorly, that it will not match the word cat at the end of the sentence.
I guess another way to word the example is to say that the engine stops after matching cat in the sentence but that the application (in our case grep) tells the engine to start again from
the end of the last match. Hence it will start looking again for a 'c' until it finds it at the start of cat and then performs the rest of the tests to see if the rest matches.
Again the engine will stop at the end of the word cat as this is the best and only match possible based on the inputted regex (ie a literal string). Again grep will tell the engine to start again
as there is more text to be read in the the form of the period at the end of the sentence. The engine will again say that this does not match a 'c' and at this point grep will stop the engine
as it has reached the end of its input.
Ah, now that you have put it that way, it makes sense
So, it is basically grep/egrep which are making it(regex engine) repeat the search in the remaining string. Well, I didn't think that it could be application specific. Since 'sed' works only on first match and needs an explicit 'g' flag for working on multiple occurrences in the same line, thought grep should also work that way.
As for documentation, I would just point you to the man pages, eg
Code:
NAME
grep, egrep, fgrep, rgrep - print lines matching a pattern
--color[=WHEN], --colour[=WHEN]
Surround the matched (non-empty) strings, matching lines, context lines, file names, line numbers, byte offsets, and separators (for fields and groups of context lines) with escape sequences to display them in
color on the terminal. The colors are defined by the environment variable GREP_COLORS. The deprecated environment variable GREP_COLOR is still supported, but its setting does not have priority. WHEN is
never, always, or auto.
I put the --color part in to highlight that all matched strings are returned.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.