LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 06-26-2012, 06:55 PM   #1
schneidz
LQ Guru
 
Registered: May 2005
Location: boston, usa
Distribution: fedora-35
Posts: 5,313

Rep: Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918Reputation: 918
weird awk output while trying to sum a field


hi, i am getting this weird awk output. i am expecting those fields to add up to 0. can anyone figure out what is going on ?

Code:
schneidz@lq> uname -a -m -p
AIX hostname 1 6 00F6yyyyyy00 powerpc
schneidz@lq> list-clp-cas-svc.ksh clm12345 vendor-835-12-06-22_16:26:55.250.12962
CLP*54321*2*257*45.53**15*clm12345*13*1
CAS*OA*23*178.54
CAS*PR*2*-124.81
SVC*HC<51798*55*99.26*0402*1
CAS*CO*94*-202
CAS*OA*23*32.93
CAS*PR*1*100**2*24.81
SVC*HC<99214<25*110*0*0510*1
CAS*CO*45*110
SVC*HC<51741*92*0*0920*1
CAS*CO*45*92

schneidz@lq> list-clp-cas-svc.ksh clm12345 vendor-835-12-06-22_16:26:55.250.12962 | awk -F \* '/^CAS.PR/ {a = a + $4 + $7} END {print a}'                             
-3.55271e-15
 
Old 06-26-2012, 09:46 PM   #2
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948
Quote:
Originally Posted by schneidz View Post
hi, i am getting this weird awk output. i am expecting those fields to add up to 0. can anyone figure out what is going on ?
It is a precision issue.

For a double-precision floating-point number, the closest representation to -124.81 is something like -124.810000000000002273736754, and the closest representation to 24.81 is something like 24.809999999999998721023076. This is because the machine representation of double-precision floating-point numbers is based on two, and not ten. Integers are exact up to about 4503599627370496 or so. Read more in the Wikipedia article on IEEE-754, the standard that just about all computers using floating-point numbers rely on.

If you sum up the machine representations (and add the hundred), you get something like -0.00000000000000355271367, which can also be written as -0.355271367e-15 in scientific notation.

The solution is simple. Instead of using the default floating-point representation, use an explicit one:
Code:
awk -F \* '/^CAS.PR/ {a = a + $4 + $7} END {printf("%.2f\n", a)}'
This one uses two decimal digits on the right side of the decimal point, and always uses the normal (non-scientific) notation.

The GNU Awk User's Manual describes printf pretty well. It also mentions which bits are unique to it, so with a bit of care you can use it as a reference even if you are using AIX awk.

(In case you are wondering, I always use parentheses around the parameters to printf in awk to remind myself and others that it is the traditional printf, and not just awk print. printf in awk works just about exactly the same way as the printf in C. Most awks also have sprintf, which allows you to "save" the formatted string into a variable.)
 
1 members found this post helpful.
Old 06-26-2012, 10:07 PM   #3
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,126

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
It's interesting though that this only shows up at zero (in this case). If you run the code snippet without $7 it resolves "accurately". The printf solution is a good one though.
 
Old 06-26-2012, 10:39 PM   #4
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948
Quote:
Originally Posted by syg00 View Post
It's interesting though that this only shows up at zero (in this case). If you run the code snippet without $7 it resolves "accurately". The printf solution is a good one though.
What do you mean? Using gawk-3.1.8,
Code:
$ printf '%s\n' -124.81 100 24.81 | awk '{ s += $1 } END { print s }'
-3.55271e-15
$ awk 'BEGIN { print OFMT }'
%.6g
I suspect that is default for most awk variants. For those who are unfamiliar with printf patterns, %.6g means "the floating point value using six significant digits, using the scientific notation when necessary".

Here, the result is zero, up to the given precision (actually, up to about 18 significant digits, as one can expect from double-precision IEEE-754 floating-point numbers). The issue is that the default floating-point pattern does not know the given precision, and just uses six significant digits. It's like saying "but there is this smudge to the right side of the number, so the actual value really is a tiny bit bigger".

I think %g is stupid. If you look at my awk snippets, I tend to use %.2f or similar. There, the 2 means two decimal digits on the right side of the decimal point. That way the "smudges" don't pollute my results -- but I need to know in advance how many decimal digits I want in my results.

Of course, if you change the order of the summation, the result will change, as the loss of precision due to cancellation changes. Remember, integer values (-2^52 .. 2^52) are exact, but the other two values are not. (Just because they're exact in decimal does not make them exact in IEEE-754 representation.) In other words,
Code:
$ printf '%s\n' -124.81 24.81 100 | awk '{ s += $1 } END { print s }'
0
$ printf '%s\n' 100 -124.81 24.81 | awk '{ s += $1 } END { print s }'
-3.55271e-15
$ printf '%s\n' 24.81 100 -124.81 | awk '{ s += $1 } END { print s }'
0
 
Old 06-26-2012, 11:01 PM   #5
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,126

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
Quote:
Originally Posted by Nominal Animal View Post
What do you mean?
This is what I mean ...
Code:
awk -F \* '/^CAS.PR/ {a = a + $4} END {print a}' aix.txt 
-24.81
That's -24.81, not -24.81+/- a "smudge".

It's always interesting the way floating point messes with things. Nice explanations BTW.
 
Old 06-26-2012, 11:29 PM   #6
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948
Quote:
Originally Posted by syg00 View Post
This is what I mean ...
Code:
awk -F \* '/^CAS.PR/ {a = a + $4} END {print a}' aix.txt 
-24.81
That's -24.81, not -24.81+/- a "smudge".
But it is -24.81 plus a smudge!

The actual value used for representing -24.81 using IEEE-754 double precision floating point numbers is exactly -24.809999999999998721023075631819665431976318359375 = -6983394172191375 × 2^-48.

The smudge just gets hidden because the default OFMT says to print the six significant digits, which here are -24.8100. %g does not print trailing decimal zeros, so it gets output as -24.81.

When the result gets close to zero, the smudge is all that is left over, and that's why it gets printed. There is just no way for poor awk to know which part of the result is actual result, and which part is just rounding smudges.

There are no magic bullets for this, either. There is no way to "always use a good output format". The needed output format depends not only on the precision and range of the input variables, but also on the computation done on them also (especially since the computation is limited in precision, and cancellation and loss of precision can and do occur). It is just one of those things that us humans have to know about, and take care of ourselves.

Last edited by Nominal Animal; 06-26-2012 at 11:58 PM.
 
1 members found this post helpful.
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] awk to sum up login time of a user on linux from output of cmd "last -a" samji9999 Programming 11 08-15-2014 12:54 PM
[SOLVED] awk: how to print a field when field position is unknown? elfoozo Programming 12 08-18-2010 03:52 AM
[SOLVED] awk - sum total if field = string schneidz Programming 12 03-20-2010 04:56 PM
awk printing from Nth field to last field sebelk Programming 2 01-08-2010 09:39 AM
awk run script on field before output Geneset Programming 3 08-26-2008 04:59 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 06:30 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration