awk computed regex not working as expected
1 Attachment(s)
So those of you that know me will agree that when it comes to awk I don't usually ask a lot of questions ... however this one has me stumped.
I am guessing I have missed something obvious but for the life of me (and I have tested at great length) I cannot find it :( So the scenario is this: The following awk code should identify all versions of libgpg-error within the attached file (see below) and only show one for each version: Code:
#!/usr/bin/awk -f Code:
1.0 If someone could please put me out of my misery it would be greatly appreciated :) If you need to know that "1.10" is in fact one of the entries that should be shown, simply add the following after the for loop: Code:
print "|"f[1]"|" An alternate test is: Code:
if(f[1] ~ "1.10") Code:
"1.0\n1.1" Code:
awk 'BEGIN{a="1.0\n1.1";if(a !~ "1.10")print "it is not there"}' To test script it can be run as: Code:
./script.awk libgpg-error.txt |
Ok ... I still have the issue but have managed to make it a lot simpler <phew>:
Input file: Code:
1.0 Code:
awk '{if(a !~ $0)a = ((a)?a"\n":"")$0}END{ print a}' input_file Further testing has shown that if you remove both "1.1" entries then "1.10" will now be shown. So the query now seems to break down to how is awk interpreting: Code:
a = "1.0\n1.1"; if(a !~ "1.10") ... |
Rightio ... after a bit more investigation it appears that because the thing being compared to is a number it seems to throw the computed regex in both directions?? (this is a guess of course)
What I mean is: Code:
awk 'BEGIN{a="1.1";b=1.10;if(a ~ b)print "yep"}' contains 1 followed by any character and then a 1, which it does so 'yep' gets printed. Changing a to be the number and b to be the string didn't seem to work however :( Unless you also change the values, so the following does work: Code:
awk 'BEGIN{a=1.10;b="1.1";if(a ~ b)print "yep"}' So I guess my question now is - does anyone know if awk's regex operator (~) is bi-directional? Colour me confused :confused: |
Hi grail,
as you have correctly determined, the problem is that $0 is evaluated as number, i.e. 1.10 becomes automatically 1.1 before the RegEx is evaluated. The solution I can offer is to enforce that $0 will be interpreted as String: Code:
gawk '{if(a !~ $0""){a = ((a)?a"\n":"")$0""}}END{ print a}' file # This is still NOT 100% ok Code:
1.0 Code:
gawk '{if(a !~ $0"$"){a = ((a)?a"\n":"")$0""}}END{ print a}' file Hope this helps. |
Well I am glad to see some direction to go in :) Thanks as always :)
I had gone down the road of using sprintf to print a string but the results are not always accurate. I am curious though at what point does it decide the variable is a number and lose the trailing zeros?? I say this as I have tested further and all of these still print the number as displayed in the file: Code:
awk '{b = $0;print b}' f1 # even tried with $1 but no diff Well thanks again for your help ... just another trap for young<cough> players |
Hi
Not sure if I understand correctly, but I think awk simply considers everything starting with a digit and not in quotes to be a number. And that means regex expressions too. Code:
echo "1.1" | awk '{if ($1 ~ 1.10) print $1}' You can also put the regex between / characters so it doesn't start with a digit. Code:
echo "1.1" | awk '{if ($1 ~ /1.10/) print $1}' Code:
echo "1.1" | awk 'BEGIN{a="1.10"} {if ($1 ~ a) print $1}' |
The issue here is that the compared to item, the 'a' in your last example, is either being set from a field or line (ie $0 or $1, and so on) or in my original example
it is an array piece that was formed by the match function. So whilst I follow your reasoning, it does not work in this instance as there is not a time during the script to set the value into a string type. Even if you do: Code:
a = "\""$0"\"" So far crts' option to add the following quotes with the end character ($) seem to do the trick for me :) |
Quote:
Code:
$ cat file |
hmmm ... I guess that would boil down to how the regex engine is viewing the input. I do also note that the manual does
warn against using computed regexes. Quote:
|
I don't think it's the regex that makes it a number. If so, wouldn't this print?
Code:
echo "1.10" | awk '{if ("1.1" ~ $1) print "awk is buggy."}' |
I agree with Guttorm. I do not think that the RegEx mechanism ever sees the "original" input. The observed behavior suggests that awk parses and evaluates $0. It then sees that $0 is an operand ( in this case the right-hand operand of the '!~' operator ) and converts it automatically to a number before it passes it to the RegEx engine. Since the manual states that the RegEx engine can only cope with String- and RegEx-constants it is possible that there is an intermediate step to convert the number to a String before it is finally passed for RegEx evaluation. So the process might look like:
1) the String 1.100 is read from file 2) it is determined that $0 is an operand and therefore 1.100 is converted to the number 1.1 3) numbers cannot be dealt with by the RegEx engine -> the number 1.1 is converted to the String 1.1 4) the String is passed to the RegEx engine, converted internally to the corresponding RegEx-constant and finally evaluated. Steps 1) - 4) are solely deduced from the observed behavior. If this is not how it works then corrections/clarification are welcome - as usual :) |
Well using modifications of Guttorm's example, I am not sure we have hit the nail on the head yet:
Code:
$ echo "1.10" | awk '{if ("1.1" ~ $1) print "awk is buggy."}' I agree there is some kind of process afoot that processes the number and truncates it then converts it back to a string for comparison. But why after a successful test, ie 1.1 is first record in last example, and then suddenly now it is converting the number but when 1.10 is on its own it does not? I also get the same results from the above if I place them in a file. |
Quote:
After your last examples my guess would be that somewhere along the conversion process there is a "stale" flag which is maybe checked to determine what kind of conversion has to be performed. Or maybe the RegEx buffers are not thoroughly purged. Well, further speculation might bring up more sentences that start with maybe. Like this one: Maybe we should report it as a bug. If using a variable as right-hand operator is that erratic and arbitrary it should print at least a warning. Anyway, appending double-quotes ($1"") seems to handle the above cases correctly. |
All times are GMT -5. The time now is 11:44 PM. |