[SOLVED] awk computed regex not working as expected

grail · 05-31-2011, 01:45 AM

So those of you that know me will agree that when it comes to awk I don't usually ask a lot of questions ... however this one has me stumped.

I am guessing I have missed something obvious but for the life of me (and I have tested at great length) I cannot find it

So the scenario is this:

The following awk code should identify all versions of libgpg-error within the attached file (see below) and only show one for each version:

Code:

#!/usr/bin/awk -f

BEGIN{
    page="libgpg-error"
    release=""
    remove="latest|diff|sig"
    IGNORECASE = 1
}

$0 !~ remove && match($0,page"-("release"[0-9][[:alnum:]_.-]+)[.]t(ar[.])?([[:alnum:]]+)",f){
    for(i = 0; f[i] != ""; i++)

	if(list !~ f[1])
		list = ((list)?list"\n":"")f[1]
}

END{ print list }

Based on the input file, I would expect to see the following:

Code:

1.0
1.1
1.10
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9

However the actual output shows all except the entry in red
If someone could please put me out of my misery it would be greatly appreciated

If you need to know that "1.10" is in fact one of the entries that should be shown, simply add the following after the for loop:

Code:

print "|"f[1]"|"

This will in fact print each version several times, but the point is that at some point f[1] does contain "1.10".
An alternate test is:

Code:

if(f[1] ~ "1.10")
    print f[1]" does exist"

I think the issue is the dot being somehow computed as anything, but prior to hitting this entry the list looks like:

Code:

"1.0\n1.1"

I have also tested against this entry on its own with:

Code:

awk 'BEGIN{a="1.0\n1.1";if(a !~ "1.10")print "it is not there"}'

This will print the line at the end - it is not there

To test script it can be run as:

Code:

./script.awk libgpg-error.txt

grail · 05-31-2011, 02:04 AM

Ok ... I still have the issue but have managed to make it a lot simpler <phew>:
Input file:

Code:

1.0
1.0
1.1
1.1
1.10
1.10
1.2
1.2

Use the following:

Code:

awk '{if(a !~ $0)a = ((a)?a"\n":"")$0}END{ print a}' input_file

This will return single entries for everything except "1.10".
Further testing has shown that if you remove both "1.1" entries then "1.10" will now be shown.

So the query now seems to break down to how is awk interpreting:

Code:

a = "1.0\n1.1"; if(a !~ "1.10") ...

grail · 05-31-2011, 04:32 AM

Rightio ... after a bit more investigation it appears that because the thing being compared to is a number it seems to throw the computed regex in both directions?? (this is a guess of course)

What I mean is:

Code:

awk 'BEGIN{a="1.1";b=1.10;if(a ~ b)print "yep"}'

Instead of this equating if a contains the string b, which may be a 1 followed by any character and the 10, it seems to have reversed this and looked to see if b
contains 1 followed by any character and then a 1, which it does so 'yep' gets printed.

Changing a to be the number and b to be the string didn't seem to work however

Unless you also change the values, so the following does work:

Code:

awk 'BEGIN{a=1.10;b="1.1";if(a ~ b)print "yep"}'

In a way I would probably have expected this one.

So I guess my question now is - does anyone know if awk's regex operator (~) is bi-directional?

Colour me confused

crts · 05-31-2011, 06:11 AM

Hi grail,

as you have correctly determined, the problem is that $0 is evaluated as number, i.e.
1.10 becomes automatically 1.1 before the RegEx is evaluated.
The solution I can offer is to enforce that $0 will be interpreted as String:

Code:

gawk '{if(a !~ $0""){a = ((a)?a"\n":"")$0""}}END{ print a}' file  # This is still NOT 100% ok

If you try the above example with the following data:

Code:

1.0
1.0
1.1
1.1
1.1000
1.1000
1.10
1.10
1.2
1.2

it still won't work as expected because when "1.10" is checked 'a' already contains '1.1000', so the expression evaluates to true. By simply adding "$" instead of just "" we can cope with that, too.

Code:

gawk '{if(a !~ $0"$"){a = ((a)?a"\n":"")$0""}}END{ print a}' file

The second bold part $0"" can probably also be just $0, not 100% sure about.

Hope this helps.

grail · 05-31-2011, 08:17 AM

Well I am glad to see some direction to go in

Thanks as always

I had gone down the road of using sprintf to print a string but the results are not always accurate.

I am curious though at what point does it decide the variable is a number and lose the trailing zeros??

I say this as I have tested further and all of these still print the number as displayed in the file:

Code:

awk '{b = $0;print b}' f1 # even tried with $1 but no diff

awk '{b = $1;printf "%s\n", b}' f1

awk '1' f1

Yet as soon as it is used in the regex it defaults to '1.1'

Well thanks again for your help ... just another trap for young<cough> players

Guttorm · 05-31-2011, 09:16 AM

Hi

Not sure if I understand correctly, but I think awk simply considers everything starting with a digit and not in quotes to be a number. And that means regex expressions too.

Code:

echo "1.1" | awk '{if ($1 ~ 1.10) print $1}'
echo "1.1" | awk '{if ($1 ~ "1.10") print $1}'

The first one matches, not the second.

You can also put the regex between / characters so it doesn't start with a digit.

Code:

echo "1.1" | awk '{if ($1 ~ /1.10/) print $1}'

For variables, this happens at assignment, so these are different:

Code:

echo "1.1" | awk 'BEGIN{a="1.10"} {if ($1 ~ a) print $1}'
echo "1.1" | awk 'BEGIN{a=1.10} {if ($1 ~ a) print $1}'

grail · 05-31-2011, 09:24 AM

The issue here is that the compared to item, the 'a' in your last example, is either being set from a field or line (ie $0 or $1, and so on) or in my original example
it is an array piece that was formed by the match function. So whilst I follow your reasoning, it does not work in this instance as there is not a time during the script
to set the value into a string type.
Even if you do:

Code:

a = "\""$0"\""

This now makes the regex look for a number surrounded by quotes.

So far crts' option to add the following quotes with the end character ($) seem to do the trick for me

crts · 05-31-2011, 09:35 AM

Quote:

Originally Posted by grail

I am curious though at what point does it decide the variable is a number and lose the trailing zeros??

Well, that is indeed a very good question. Let us have a look at the following example:

Code:

$ cat file
1.1000
1.10
$ gawk '{a=$0;print a}' file
1.1000
1.10
$ gawk '{a=($0 + 0.03);print a}' file
1.13
1.13

So apparently the conversion is context dependent. If there is a mathematical operation involved it converts it to a number. Since a RegEx is not a mathematical expression the previously observed behavior could be classified as a bug - IMHO.

grail · 05-31-2011, 10:11 AM

hmmm ... I guess that would boil down to how the regex engine is viewing the input. I do also note that the manual does
warn against using computed regexes.

Quote:

Given that you can use both regexp and string constants to describe regular expressions, which should you use? The answer is “regexp constants,” for several reasons:

1. String constants are more complicated to write and more difficult to read. Using regexp constants makes your programs less error-prone. Not understanding the difference between the two kinds of constants is a common source of errors.
2. It is more efficient to use regexp constants. awk can note that you have supplied a regexp and store it internally in a form that makes pattern matching more efficient. When using a string constant, awk must first convert the string into this internal form and then perform the pattern matching.
3. Using regexp constants is better form; it shows clearly that you intend a regexp match.

Although I wonder if the double pass used to perform a computed regex may be part of the issue (part in red)?

Guttorm · 05-31-2011, 10:58 AM

I don't think it's the regex that makes it a number. If so, wouldn't this print?

Code:

echo "1.10" | awk '{if ("1.1" ~ $1) print "awk is buggy."}'

No awk expert, but I know in for example Javascript, a regexp constant can be compiled at "compile time", but string expressions compiles in run time. I think that's what the manual means. It can make a huge difference if it's in a loop.

crts · 05-31-2011, 12:07 PM

I agree with Guttorm. I do not think that the RegEx mechanism ever sees the "original" input. The observed behavior suggests that awk parses and evaluates $0. It then sees that $0 is an operand ( in this case the right-hand operand of the '!~' operator ) and converts it automatically to a number before it passes it to the RegEx engine. Since the manual states that the RegEx engine can only cope with String- and RegEx-constants it is possible that there is an intermediate step to convert the number to a String before it is finally passed for RegEx evaluation. So the process might look like:
1) the String 1.100 is read from file
2) it is determined that $0 is an operand and therefore 1.100 is converted to the number 1.1
3) numbers cannot be dealt with by the RegEx engine -> the number 1.1 is converted to the String 1.1
4) the String is passed to the RegEx engine, converted internally to the corresponding RegEx-constant and finally evaluated.

Steps 1) - 4) are solely deduced from the observed behavior. If this is not how it works then corrections/clarification are welcome - as usual

grail · 06-01-2011, 04:58 AM

Well using modifications of Guttorm's example, I am not sure we have hit the nail on the head yet:

Code:

$ echo "1.10" | awk '{if ("1.1" ~ $1) print "awk is buggy."}'
$ echo -e "1.0\n1.10\n1.1" | awk '{if ("1.1" ~ $1) print "awk is buggy."}'
$ echo -e "1.1\n1.10" | awk '{if ("1.1" ~ $1) print "awk is buggy."}'
awk is buggy.
awk is buggy.

So individual entry and a combined entry which does have the expected line does not work yet the last says that both have what we need???

I agree there is some kind of process afoot that processes the number and truncates it then converts it back to a string for comparison.
But why after a successful test, ie 1.1 is first record in last example, and then suddenly now it is converting the number but when 1.10 is on its own it does not?

I also get the same results from the above if I place them in a file.

crts · 06-01-2011, 11:16 AM

Quote:

Originally Posted by grail

So individual entry and a combined entry which does have the expected line does not work yet the last says that both have what we need???
....
But why after a successful test, ie 1.1 is first record in last example, and then suddenly now it is converting the number but when 1.10 is on its own it does not?

Does 'awk' have a 'set -x' equivalent like bash? If not then our chances to determine what exactly is going on are rather slim.

After your last examples my guess would be that somewhere along the conversion process there is a "stale" flag which is maybe checked to determine what kind of conversion has to be performed.
Or maybe the RegEx buffers are not thoroughly purged.

Well, further speculation might bring up more sentences that start with maybe.
Like this one:
Maybe we should report it as a bug. If using a variable as right-hand operator is that erratic and arbitrary it should print at least a warning.

Anyway, appending double-quotes ($1"") seems to handle the above cases correctly.