LinuxQuestions.org
Support LQ: Use code LQ3 and save $3 on Domain Registration
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 05-31-2011, 01:45 AM   #1
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,478

Rep: Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888
awk computed regex not working as expected


So those of you that know me will agree that when it comes to awk I don't usually ask a lot of questions ... however this one has me stumped.

I am guessing I have missed something obvious but for the life of me (and I have tested at great length) I cannot find it

So the scenario is this:

The following awk code should identify all versions of libgpg-error within the attached file (see below) and only show one for each version:
Code:
#!/usr/bin/awk -f

BEGIN{
    page="libgpg-error"
    release=""
    remove="latest|diff|sig"
    IGNORECASE = 1
}

$0 !~ remove && match($0,page"-("release"[0-9][[:alnum:]_.-]+)[.]t(ar[.])?([[:alnum:]]+)",f){
    for(i = 0; f[i] != ""; i++)

	if(list !~ f[1])
		list = ((list)?list"\n":"")f[1]
}

END{ print list }
Based on the input file, I would expect to see the following:
Code:
1.0
1.1
1.10
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
However the actual output shows all except the entry in red
If someone could please put me out of my misery it would be greatly appreciated
If you need to know that "1.10" is in fact one of the entries that should be shown, simply add the following after the for loop:
Code:
print "|"f[1]"|"
This will in fact print each version several times, but the point is that at some point f[1] does contain "1.10".
An alternate test is:
Code:
if(f[1] ~ "1.10")
    print f[1]" does exist"
I think the issue is the dot being somehow computed as anything, but prior to hitting this entry the list looks like:
Code:
"1.0\n1.1"
I have also tested against this entry on its own with:
Code:
awk 'BEGIN{a="1.0\n1.1";if(a !~ "1.10")print "it is not there"}'
This will print the line at the end - it is not there

To test script it can be run as:
Code:
./script.awk libgpg-error.txt
Attached Files
File Type: txt libgpg-error.txt (7.2 KB, 5 views)
 
Click here to see the post LQ members have rated as the most helpful post in this thread.
Old 05-31-2011, 02:04 AM   #2
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,478

Original Poster
Rep: Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888
Ok ... I still have the issue but have managed to make it a lot simpler <phew>:
Input file:
Code:
1.0
1.0
1.1
1.1
1.10
1.10
1.2
1.2
Use the following:
Code:
awk '{if(a !~ $0)a = ((a)?a"\n":"")$0}END{ print a}' input_file
This will return single entries for everything except "1.10".
Further testing has shown that if you remove both "1.1" entries then "1.10" will now be shown.

So the query now seems to break down to how is awk interpreting:
Code:
a = "1.0\n1.1"; if(a !~ "1.10") ...
 
Old 05-31-2011, 04:32 AM   #3
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,478

Original Poster
Rep: Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888
Rightio ... after a bit more investigation it appears that because the thing being compared to is a number it seems to throw the computed regex in both directions?? (this is a guess of course)

What I mean is:
Code:
awk 'BEGIN{a="1.1";b=1.10;if(a ~ b)print "yep"}'
Instead of this equating if a contains the string b, which may be a 1 followed by any character and the 10, it seems to have reversed this and looked to see if b
contains 1 followed by any character and then a 1, which it does so 'yep' gets printed.

Changing a to be the number and b to be the string didn't seem to work however Unless you also change the values, so the following does work:
Code:
awk 'BEGIN{a=1.10;b="1.1";if(a ~ b)print "yep"}'
In a way I would probably have expected this one.

So I guess my question now is - does anyone know if awk's regex operator (~) is bi-directional?

Colour me confused
 
Old 05-31-2011, 06:11 AM   #4
crts
Senior Member
 
Registered: Jan 2010
Posts: 1,604

Rep: Reputation: 446Reputation: 446Reputation: 446Reputation: 446Reputation: 446
Hi grail,

as you have correctly determined, the problem is that $0 is evaluated as number, i.e.
1.10 becomes automatically 1.1 before the RegEx is evaluated.
The solution I can offer is to enforce that $0 will be interpreted as String:
Code:
gawk '{if(a !~ $0""){a = ((a)?a"\n":"")$0""}}END{ print a}' file  # This is still NOT 100% ok
If you try the above example with the following data:
Code:
1.0
1.0
1.1
1.1
1.1000
1.1000
1.10
1.10
1.2
1.2
it still won't work as expected because when "1.10" is checked 'a' already contains '1.1000', so the expression evaluates to true. By simply adding "$" instead of just "" we can cope with that, too.
Code:
gawk '{if(a !~ $0"$"){a = ((a)?a"\n":"")$0""}}END{ print a}' file
The second bold part $0"" can probably also be just $0, not 100% sure about.

Hope this helps.

Last edited by crts; 05-31-2011 at 06:12 AM.
 
2 members found this post helpful.
Old 05-31-2011, 08:17 AM   #5
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,478

Original Poster
Rep: Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888
Well I am glad to see some direction to go in Thanks as always

I had gone down the road of using sprintf to print a string but the results are not always accurate.

I am curious though at what point does it decide the variable is a number and lose the trailing zeros??

I say this as I have tested further and all of these still print the number as displayed in the file:
Code:
awk '{b = $0;print b}' f1 # even tried with $1 but no diff

awk '{b = $1;printf "%s\n", b}' f1

awk '1' f1
Yet as soon as it is used in the regex it defaults to '1.1'

Well thanks again for your help ... just another trap for young<cough> players
 
Old 05-31-2011, 09:16 AM   #6
Guttorm
Senior Member
 
Registered: Dec 2003
Location: Trondheim, Norway
Distribution: Debian and Ubuntu
Posts: 1,134

Rep: Reputation: 218Reputation: 218Reputation: 218
Hi

Not sure if I understand correctly, but I think awk simply considers everything starting with a digit and not in quotes to be a number. And that means regex expressions too.

Code:
echo "1.1" | awk '{if ($1 ~ 1.10) print $1}'
echo "1.1" | awk '{if ($1 ~ "1.10") print $1}'
The first one matches, not the second.

You can also put the regex between / characters so it doesn't start with a digit.
Code:
echo "1.1" | awk '{if ($1 ~ /1.10/) print $1}'
For variables, this happens at assignment, so these are different:
Code:
echo "1.1" | awk 'BEGIN{a="1.10"} {if ($1 ~ a) print $1}'
echo "1.1" | awk 'BEGIN{a=1.10} {if ($1 ~ a) print $1}'
 
Old 05-31-2011, 09:24 AM   #7
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,478

Original Poster
Rep: Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888
The issue here is that the compared to item, the 'a' in your last example, is either being set from a field or line (ie $0 or $1, and so on) or in my original example
it is an array piece that was formed by the match function. So whilst I follow your reasoning, it does not work in this instance as there is not a time during the script
to set the value into a string type.
Even if you do:
Code:
a = "\""$0"\""
This now makes the regex look for a number surrounded by quotes.

So far crts' option to add the following quotes with the end character ($) seem to do the trick for me
 
Old 05-31-2011, 09:35 AM   #8
crts
Senior Member
 
Registered: Jan 2010
Posts: 1,604

Rep: Reputation: 446Reputation: 446Reputation: 446Reputation: 446Reputation: 446
Quote:
Originally Posted by grail View Post
I am curious though at what point does it decide the variable is a number and lose the trailing zeros??
Well, that is indeed a very good question. Let us have a look at the following example:
Code:
$ cat file
1.1000
1.10
$ gawk '{a=$0;print a}' file
1.1000
1.10
$ gawk '{a=($0 + 0.03);print a}' file
1.13
1.13
So apparently the conversion is context dependent. If there is a mathematical operation involved it converts it to a number. Since a RegEx is not a mathematical expression the previously observed behavior could be classified as a bug - IMHO.
 
Old 05-31-2011, 10:11 AM   #9
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,478

Original Poster
Rep: Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888
hmmm ... I guess that would boil down to how the regex engine is viewing the input. I do also note that the manual does
warn against using computed regexes.

Quote:
Given that you can use both regexp and string constants to describe regular expressions, which should you use? The answer is “regexp constants,” for several reasons:

1. String constants are more complicated to write and more difficult to read. Using regexp constants makes your programs less error-prone. Not understanding the difference between the two kinds of constants is a common source of errors.
2. It is more efficient to use regexp constants. awk can note that you have supplied a regexp and store it internally in a form that makes pattern matching more efficient. When using a string constant, awk must first convert the string into this internal form and then perform the pattern matching.
3. Using regexp constants is better form; it shows clearly that you intend a regexp match.
Although I wonder if the double pass used to perform a computed regex may be part of the issue (part in red)?
 
Old 05-31-2011, 10:58 AM   #10
Guttorm
Senior Member
 
Registered: Dec 2003
Location: Trondheim, Norway
Distribution: Debian and Ubuntu
Posts: 1,134

Rep: Reputation: 218Reputation: 218Reputation: 218
I don't think it's the regex that makes it a number. If so, wouldn't this print?

Code:
echo "1.10" | awk '{if ("1.1" ~ $1) print "awk is buggy."}'
No awk expert, but I know in for example Javascript, a regexp constant can be compiled at "compile time", but string expressions compiles in run time. I think that's what the manual means. It can make a huge difference if it's in a loop.

Last edited by Guttorm; 05-31-2011 at 11:37 AM.
 
Old 05-31-2011, 12:07 PM   #11
crts
Senior Member
 
Registered: Jan 2010
Posts: 1,604

Rep: Reputation: 446Reputation: 446Reputation: 446Reputation: 446Reputation: 446
I agree with Guttorm. I do not think that the RegEx mechanism ever sees the "original" input. The observed behavior suggests that awk parses and evaluates $0. It then sees that $0 is an operand ( in this case the right-hand operand of the '!~' operator ) and converts it automatically to a number before it passes it to the RegEx engine. Since the manual states that the RegEx engine can only cope with String- and RegEx-constants it is possible that there is an intermediate step to convert the number to a String before it is finally passed for RegEx evaluation. So the process might look like:
1) the String 1.100 is read from file
2) it is determined that $0 is an operand and therefore 1.100 is converted to the number 1.1
3) numbers cannot be dealt with by the RegEx engine -> the number 1.1 is converted to the String 1.1
4) the String is passed to the RegEx engine, converted internally to the corresponding RegEx-constant and finally evaluated.

Steps 1) - 4) are solely deduced from the observed behavior. If this is not how it works then corrections/clarification are welcome - as usual
 
Old 06-01-2011, 04:58 AM   #12
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,478

Original Poster
Rep: Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888Reputation: 1888
Well using modifications of Guttorm's example, I am not sure we have hit the nail on the head yet:
Code:
$ echo "1.10" | awk '{if ("1.1" ~ $1) print "awk is buggy."}'
$ echo -e "1.0\n1.10\n1.1" | awk '{if ("1.1" ~ $1) print "awk is buggy."}'
$ echo -e "1.1\n1.10" | awk '{if ("1.1" ~ $1) print "awk is buggy."}'
awk is buggy.
awk is buggy.
So individual entry and a combined entry which does have the expected line does not work yet the last says that both have what we need???

I agree there is some kind of process afoot that processes the number and truncates it then converts it back to a string for comparison.
But why after a successful test, ie 1.1 is first record in last example, and then suddenly now it is converting the number but when 1.10 is on its own it does not?

I also get the same results from the above if I place them in a file.
 
Old 06-01-2011, 11:16 AM   #13
crts
Senior Member
 
Registered: Jan 2010
Posts: 1,604

Rep: Reputation: 446Reputation: 446Reputation: 446Reputation: 446Reputation: 446
Quote:
Originally Posted by grail View Post
So individual entry and a combined entry which does have the expected line does not work yet the last says that both have what we need???
....
But why after a successful test, ie 1.1 is first record in last example, and then suddenly now it is converting the number but when 1.10 is on its own it does not?
Does 'awk' have a 'set -x' equivalent like bash? If not then our chances to determine what exactly is going on are rather slim.

After your last examples my guess would be that somewhere along the conversion process there is a "stale" flag which is maybe checked to determine what kind of conversion has to be performed.
Or maybe the RegEx buffers are not thoroughly purged.

Well, further speculation might bring up more sentences that start with maybe.
Like this one:
Maybe we should report it as a bug. If using a variable as right-hand operator is that erratic and arbitrary it should print at least a warning.

Anyway, appending double-quotes ($1"") seems to handle the above cases correctly.
 
  


Reply

Tags
awk, regular expression


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] awk does not print the value of the variable as I expected, please help jozelo Linux - Newbie 4 04-04-2011 01:17 PM
awk regex with variable bertl1982 Linux - General 2 03-17-2010 08:38 AM
What should some Regex match in awk? sebelk Programming 7 11-20-2009 06:38 PM
printf \b not working as expected in AWK Rockydell Programming 5 11-13-2009 07:28 AM
awk regex question Guest1234 Programming 6 12-25-2007 01:31 PM


All times are GMT -5. The time now is 01:35 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration