[SOLVED] grep shortest matches to regex

porphyry5 · 07-28-2012, 01:19 PM

Thank you but never mind, found the solution. I needed

Code:

grep -Po "\((.*?)\)" <<< "$d"

Suppose

Code:

d='(grep this) don't grep this (but do grep this)'

and I want grep to return just the parts in parentheses, just one part to a line using "grep -o", i.e. as

Code:

(grep this)
(but do grep this)

What regex do I need to use here. With "grep -o", "grep -Eo" and "grep -Po" I have tried umpteen variations of '\(.*\)' '\(.?\)' '\(.+\)' both with and without \ quoting, and with and without all possible variants of \{-}. I get back either nothing, or the entire line, or each single character on its own line

David the H. · 07-29-2012, 05:02 AM

When you use grep without -E, it uses basic regular expressions. In basic regex, the characters ?, +, {, |, (, and ) are considered literal. In gnu grep, prefixing these characters with a backslash enables their special meanings.

When you use -E, then it uses extended regular expressions, and the above characters are considered special by default. Backslashing them now disables their special meanings so that they become literal.

So in a nutshell, use -E if you need to use a lot of fancy regular expression features, and don't use it if you need to use a lot of literal characters like that.

See the grep man and info pages for more on the differences between basic and extended regex. sed works the same way with its -r option, BTW.

Incidentally, I personally prefer to surround characters that need to be literal in "[]" bracket expressions, rather using than backslashes. It's cleaner and more portable overall.

In any case your real problem isn't with grep, it's with the greediness of regex tokens like "*". They always capture the longest possible match. This means that '(.*)' will reach all the way to the final closing parentheses in the line.

The usual way to counter that is to use a negating bracket expression. Match everything that's not that character, until you find one that is. Like this:

Code:

grep -o '([^)]*)'
grep -Eo '[(][^)]+[)]'

The "+" in the second one ensures that the parentheses must actually contain something in order to match. Use "*" if you want to match empty ones.

Finally, as you appear to have discovered, perl-compatible regular expressions allow you to to disable greediness -- by appending the greedy token with a "?". So if you use the -P option, then your expression could look like this:

Code:

grep -Po '[(].*?[)]'

Note finally that "-P" and the backslashing of the above characters in basic regex are gnu extensions. they likely won't be available to you if you ever need to use a non-gnu version of grep.

porphyry5 · 07-29-2012, 07:25 PM

Quote:

Originally Posted by David the H.

Code:

grep -o '([^)]*)'

Thank you very much, that is an eye-opening way of approaching the greedy/non-greedy issue, much nicer solution, and

Code:

grep -o '([^)]\+)'

to ensure there is content in the parentheses. I prefer to use basic regexes whenever possible, don't get confused then by the differences in nomenclature.