[SOLVED] awk computed regex not working as expected
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Based on the input file, I would expect to see the following:
Code:
1.0
1.1
1.10
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
However the actual output shows all except the entry in red
If someone could please put me out of my misery it would be greatly appreciated
If you need to know that "1.10" is in fact one of the entries that should be shown, simply add the following after the for loop:
Code:
print "|"f[1]"|"
This will in fact print each version several times, but the point is that at some point f[1] does contain "1.10".
An alternate test is:
Code:
if(f[1] ~ "1.10")
print f[1]" does exist"
I think the issue is the dot being somehow computed as anything, but prior to hitting this entry the list looks like:
Code:
"1.0\n1.1"
I have also tested against this entry on its own with:
Code:
awk 'BEGIN{a="1.0\n1.1";if(a !~ "1.10")print "it is not there"}'
This will print the line at the end - it is not there
To test script it can be run as:
Code:
./script.awk libgpg-error.txt
Click here to see the post LQ members have rated as the most helpful post in this thread.
This will return single entries for everything except "1.10".
Further testing has shown that if you remove both "1.1" entries then "1.10" will now be shown.
So the query now seems to break down to how is awk interpreting:
Rightio ... after a bit more investigation it appears that because the thing being compared to is a number it seems to throw the computed regex in both directions?? (this is a guess of course)
What I mean is:
Code:
awk 'BEGIN{a="1.1";b=1.10;if(a ~ b)print "yep"}'
Instead of this equating if a contains the string b, which may be a 1 followed by any character and the 10, it seems to have reversed this and looked to see if b
contains 1 followed by any character and then a 1, which it does so 'yep' gets printed.
Changing a to be the number and b to be the string didn't seem to work however Unless you also change the values, so the following does work:
Code:
awk 'BEGIN{a=1.10;b="1.1";if(a ~ b)print "yep"}'
In a way I would probably have expected this one.
So I guess my question now is - does anyone know if awk's regex operator (~) is bi-directional?
as you have correctly determined, the problem is that $0 is evaluated as number, i.e.
1.10 becomes automatically 1.1 before the RegEx is evaluated.
The solution I can offer is to enforce that $0 will be interpreted as String:
Code:
gawk '{if(a !~ $0""){a = ((a)?a"\n":"")$0""}}END{ print a}' file # This is still NOT 100% ok
If you try the above example with the following data:
Code:
1.0
1.0
1.1
1.1
1.1000
1.1000
1.10
1.10
1.2
1.2
it still won't work as expected because when "1.10" is checked 'a' already contains '1.1000', so the expression evaluates to true. By simply adding "$" instead of just "" we can cope with that, too.
Not sure if I understand correctly, but I think awk simply considers everything starting with a digit and not in quotes to be a number. And that means regex expressions too.
The issue here is that the compared to item, the 'a' in your last example, is either being set from a field or line (ie $0 or $1, and so on) or in my original example
it is an array piece that was formed by the match function. So whilst I follow your reasoning, it does not work in this instance as there is not a time during the script
to set the value into a string type.
Even if you do:
Code:
a = "\""$0"\""
This now makes the regex look for a number surrounded by quotes.
So far crts' option to add the following quotes with the end character ($) seem to do the trick for me
So apparently the conversion is context dependent. If there is a mathematical operation involved it converts it to a number. Since a RegEx is not a mathematical expression the previously observed behavior could be classified as a bug - IMHO.
hmmm ... I guess that would boil down to how the regex engine is viewing the input. I do also note that the manual does
warn against using computed regexes.
Quote:
Given that you can use both regexp and string constants to describe regular expressions, which should you use? The answer is “regexp constants,” for several reasons:
1. String constants are more complicated to write and more difficult to read. Using regexp constants makes your programs less error-prone. Not understanding the difference between the two kinds of constants is a common source of errors.
2. It is more efficient to use regexp constants. awk can note that you have supplied a regexp and store it internally in a form that makes pattern matching more efficient. When using a string constant, awk must first convert the string into this internal form and then perform the pattern matching.
3. Using regexp constants is better form; it shows clearly that you intend a regexp match.
Although I wonder if the double pass used to perform a computed regex may be part of the issue (part in red)?
No awk expert, but I know in for example Javascript, a regexp constant can be compiled at "compile time", but string expressions compiles in run time. I think that's what the manual means. It can make a huge difference if it's in a loop.
I agree with Guttorm. I do not think that the RegEx mechanism ever sees the "original" input. The observed behavior suggests that awk parses and evaluates $0. It then sees that $0 is an operand ( in this case the right-hand operand of the '!~' operator ) and converts it automatically to a number before it passes it to the RegEx engine. Since the manual states that the RegEx engine can only cope with String- and RegEx-constants it is possible that there is an intermediate step to convert the number to a String before it is finally passed for RegEx evaluation. So the process might look like:
1) the String 1.100 is read from file
2) it is determined that $0 is an operand and therefore 1.100 is converted to the number 1.1
3) numbers cannot be dealt with by the RegEx engine -> the number 1.1 is converted to the String 1.1
4) the String is passed to the RegEx engine, converted internally to the corresponding RegEx-constant and finally evaluated.
Steps 1) - 4) are solely deduced from the observed behavior. If this is not how it works then corrections/clarification are welcome - as usual
Well using modifications of Guttorm's example, I am not sure we have hit the nail on the head yet:
Code:
$ echo "1.10" | awk '{if ("1.1" ~ $1) print "awk is buggy."}'
$ echo -e "1.0\n1.10\n1.1" | awk '{if ("1.1" ~ $1) print "awk is buggy."}'
$ echo -e "1.1\n1.10" | awk '{if ("1.1" ~ $1) print "awk is buggy."}'
awk is buggy.
awk is buggy.
So individual entry and a combined entry which does have the expected line does not work yet the last says that both have what we need???
I agree there is some kind of process afoot that processes the number and truncates it then converts it back to a string for comparison.
But why after a successful test, ie 1.1 is first record in last example, and then suddenly now it is converting the number but when 1.10 is on its own it does not?
I also get the same results from the above if I place them in a file.
So individual entry and a combined entry which does have the expected line does not work yet the last says that both have what we need???
....
But why after a successful test, ie 1.1 is first record in last example, and then suddenly now it is converting the number but when 1.10 is on its own it does not?
Does 'awk' have a 'set -x' equivalent like bash? If not then our chances to determine what exactly is going on are rather slim.
After your last examples my guess would be that somewhere along the conversion process there is a "stale" flag which is maybe checked to determine what kind of conversion has to be performed.
Or maybe the RegEx buffers are not thoroughly purged.
Well, further speculation might bring up more sentences that start with maybe.
Like this one:
Maybe we should report it as a bug. If using a variable as right-hand operator is that erratic and arbitrary it should print at least a warning.
Anyway, appending double-quotes ($1"") seems to handle the above cases correctly.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.