LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   bash regexp string compare stopped working (https://www.linuxquestions.org/questions/programming-9/bash-regexp-string-compare-stopped-working-839286/)

jcrowley 10-20-2010 09:28 AM

bash regexp string compare stopped working
 
Have a bash script which contains a line like this:

if [[ ${array[${last}]} =~ "screenpc.PRODUCTION.*" ]]

which WORKED as expected in bash 4.0.33 and now fails in 4.1.2

Instrumented the script to print the value of the left-hand side and it is exactly what is expected.

As noted above, this has been working fine until we installed Fedora 13 (kernel 2.6.33), and now it fails.

Tried setting shell 'extglob' to On with same results.

Did something change? Are there other shell/bash options that need to be set?

Thanks for any help -- this has the whole installation stopped!

jcrowley 10-20-2010 10:16 AM

also tried compat31
 
Turned on this shell option -- still getting incorrect results.

colucix 10-20-2010 10:43 AM

Not sure why it worked previously, but the asterisk inside double quotes is treated literally. For a correct pattern matching you can try
Code:

if [[ ${array[${last}]} =~ screenpc.PRODUCTION.* ]]
but please note that in this case it is totally unnecessary, since you can obtain the same result by matching only the string screenpc.PRODUCTION. It would have sense if you wanted to match any part of the string embedded in other parts, for example:
Code:

if [[ ${array[${last}]} =~ screenpc.PRODUCTION.*something ]]
Edit: an aside note: the compat31 option works for me (even with the quoted pattern). I enabled it using shopt -s compat31.

Edit: after a little search I found the rule introduced in bash 3.2 which changes the behavior in respect of previous versions: from the bash reference manual:
Quote:

An additional binary operator, ‘=~’, is available, ... Any part of the pattern may be quoted to force it to be matched as a string.
and from the change log of Bash 3.2:
Quote:

Quoting the string argument to the [[ command's =~ operator now forces string matching, as with the other pattern-matching operators.
This means that if the entire pattern (or part of it) is embedded in quotes, it is treated as a string (not a pattern anymore).

jcrowley 10-20-2010 02:40 PM

That works -- thanks.

Also works dropping the .* as you said, although this still confuses me. The left-side string does have more characters (e.g. screenpc.PRODUCTION.20100115), so I thought the final .* would be needed so that the regexp actually matched.

You are implying that the =~ will be true if the right-hand side matches anything within the left-hand side? i.e. implicitly .*matchthis.*

colucix 10-20-2010 05:00 PM

Quote:

Originally Posted by jcrowley (Post 4133902)
Also works dropping the .* as you said, although this still confuses me. The left-side string does have more characters (e.g. screenpc.PRODUCTION.20100115), so I thought the final .* would be needed so that the regexp actually matched.

To clarify, the =~ operator implies that the right-hand side is an extended regular expression. For the regexp rules the two characters .* together mean "zero or any number of occurrences of any single character". In practice it matches anything, including the null string.

In this case the expression
Code:

screenpc.PRODUCTION.*
matches any string containing "screenpc?PRODUCTION" where the question mark means any character, for example:
Code:

somethingherescreenpc.PRODUCTIONsomethingelse
any text screen.PRODUCTION any text
screenZPRODUCTION
9999screen3PRODUCTION9999

and so on. The same happens if you omit the .* at the end, since it matches any character (dot) that appears zero or more times (asterisk).

The question is: do you want to match a string like
Code:

screenpc.PRODUCTION.20100115
where a literal dot followed by an eight-digit date MUST appear after screen.PRODUCTION? In that case a more refined regular expression could be:
Code:

if [[ ${array[${last}]} =~ screenpc\.PRODUCTION\.[0-9]{8} ]]
where the dots are escaped to match their literal meaning, and a number must be repeated 8 times after the second dot.

Moreover, if you want to match the exact string without any other character before or after the string itself, you can use anchors as in
Code:

if [[ ${array[${last}]} =~ ^screenpc\.PRODUCTION\.[0-9]{8}$ ]]
Finally, from bash >= 3.2 we can write this expression as
Code:

if [[ ${array[${last}]} =~ ^"screenpc.PRODUCTION."[0-9]{8}$ ]]
where the part between double quotes has to be interpreted as literal (dots included). Hope this clarifies. :)

grail 10-20-2010 08:43 PM

Although I do this with great trepidation, I need to make an amendment to colucix's post:
Quote:

matches any string containing "screenpc?PRODUCTION" where the question mark means any character, for example:
A question mark means 0 or 1 of the previous character, so it will match the following:
Code:

screenpcPRODUCTION
screenpPRODUCTION

# but not
screenpc.PRODUCTION
# As this now has 2 characters between p and P


colucix 10-21-2010 01:44 AM

Hi grail! :) Actually the question mark was a personal notation (not syntax). BTW, thank you for the notification, I should have chosen another character or maybe another color to avoid confusion.

jcrowley 10-21-2010 07:22 AM

Still somewhat confused, so I'm missing something. Here's how I interpreted the matching logic -- could you tell me where I'm off track?

aabbcc =~ aabbcc matches trivially since the strings are identical

aabbcc =~ aabb.* matches -- the 'aabb' sections match exactly, then the .* matches the 'cc' section

aabbcc =~ aabb does not match -- the 'aabb' sections match but there is nothing in the regexp to match the 'cc' portion on the left

aabbcc =~ bbcc does not match -- the 'bbcc' sections match, but there is nothing in the regexp to match the 'aa' portion

aabbcc =~ .*bbcc matches -- the .* matches the leading 'aa', and then the 'bbcc' sections match

aabbcc =~ .*bb.* matches -- the first .* matches the 'aa', then the 'bb' sections match, then the trailing .* matches the 'cc'

So in the actual case, I would expect these results:

screenpc.PRODUCTION.20100908 =~ screenpc.PRODUCTION.* matches -- the 'screenpc' matches, the first '.' matches any character (which just happens to also be a '.' in the original string), then 'PRODUCTION' matches, and finally the '.*' matches any set of trailing characters -- '.20100908' in this case.

screenpc.PRODUCTION.20100908 =~ screenpc.PRODUCTION does not match -- the 'screenpc.PRODUCTION' section matches as above, but then there is nothing in the regexp to match the '.20100908' portion of the original string.


If the last case does in fact produce a match, then I would think that the definition of the '=~' operator needs to be stated as:

"the regexp on the right matches A SUBSTRING in the string on the left"

i.e. it's more a 'search for a string' as opposed to 'the regexp matches the string on the left' -- which may in fact be the actual definition, and the book I've looked at is imprecise.

Sorry to belabor the point, but since the system does in fact appear to match the last example as you said, it's clear that I'm missing something fundamental and would like to get it straight.

Thanks

colucix 10-21-2010 08:49 AM

Actually you miss a main point: a regular expression is a kind of search pattern. You have a string of any length and a regular expression which describes a sequence of characters to be searched inside the string. In other words a string matches a regular expression when it contains the minimal sequence of characters described by the regular expression itself.

Hence it is not mandatory to write a regular expression that matches the entire string. Nevertheless you can refine the regular expression to match only the string (or a set of possible strings) you want.

An example of regular expression refinement: the following:
Code:

.
matches any string except the null string. This means that any string of one or more characters is matched. If you want to match a string of at least three characters you will use
Code:

...
If you want to match a string whose length is exactly three characters, you have to use anchors to match the beginning and the end of the string:
Code:

^...$
Better now? :)

jcrowley 10-21-2010 09:22 AM

Yes, that is exactly the clarification I needed.

Thanks.

grail 10-21-2010 12:03 PM

Glad you go there :) Please mark as SOLVED now you have a solution.


All times are GMT -5. The time now is 06:39 AM.