[SOLVED] Once again... awk.. awk... awk

shivaa · 12-08-2012, 08:53 AM

I have a file file1, containing 50000 entries (numerical floating point numbers only).

And I am doing:
1. Sum of total no. of lines
2. Sum of lines containing values (i.e. val1) that are less than 1
3. Sum of lines containing values (i.e. val2) that are greater than 1
4. Percentage of both val1 and val2.

And I did (Note: All below code is part of a script which is generating file1):

Code:

#sum
sum=$(awk 'END{print NR}' file1)

#calculating val1
val1=$(awk '$1<1 { n++ } END{ print n }' file1)
prctg1=$(echo | awk "{print $val1*100/$sum}")

#calculating val2
val2=$(awk '$1>1 { n++ } END{ print n }' file1)
prctg2=$(echo | awk "{print $val2*100/$sum}")

echo "$sum\t$prctg1\t$prctg2"

It's fine upto this. But I want to combine both val1 and perctg1 commands in a one-liner awk code. I tried, but perhaps making some syntax mistake and I have no clue! So any suggestions that how can I combine them?

BTW, LQ has always been so helpful to me. Infact, I am in learning phase of awk, so I could applied what I've learned so far. But still expecting your help again

druuna · 12-08-2012, 09:25 AM

So we meet again

Is there a specific reason why you use multiple awk statements? All your requirements can be done with one awk statement:

Code:

awk 'BEGIN{val1=0;val2=0}/^[\.0]\./{val1++}/^[1-9]/{val2++}END{ print "sum: ",NR, "val1: ", val1, "("val1*100/NR"%)", "val2: ", val2, "("val2*100/NR"%)"}' infile

or in a bit more readable form:

Code:

awk 'BEGIN{
  val1 = 0 ;
  val2 = 0
}
/^[\.0]\./ { val1++ } # less then 1
/^[1-9]/   { val2++ } # larger then one
END{ 
  print "sum: ",NR, "val1: ", val1, "("val1*100/NR"%)", "val2: ", val2, "("val2*100/NR"%)"
}' infile

Output looks like this:

Code:

sum:  13 val1:  7 (53.8462%) val2:  6 (46.1538%)

grail · 12-08-2012, 09:40 AM

As druuna has abley answered the important question, just let me add a correction in wasted code:

Code:

# below is a useless use of echo
prctg2=$(echo | awk "{print $val2*100/$sum}")

prctg2=$(awk 'BEGIN{print $val2*100/$sum}' val2=$val2 sum=$sum)
# the above negates the problem of letting the shell interfere with any of the data

druuna · 12-09-2012, 04:44 AM

@shivaa: I noticed you read grail's and my reply. If this is solved can you put up the [SOLVED] tag...

BTW: If you ever do need the calculated values outside of awk you can do the following:

Code:

#!/bin/bash
# option 1
echo "------------------------------------"
echo -e "One way (using process substitution)\n"

while read SUM VAL1 PCT1 VAL2 PCT2
do
  # do your stuff here
  echo "Sum         : $SUM"
  echo "Val1        : $VAL1"
  echo "Percentage1 : $PCT1"
  echo "Val2        : $VAL2"
  echo "Percentage2 : $PCT2"
done < <(awk 'BEGIN{val1=0;val2=0}/^[\.0]\./{val1++}/^[1-9]/{val2++}END{ print NR, val1, val1*100/NR, val2, val2*100/NR}' infile)

# option 2
echo -e "\n-----------------------------"
echo -e "An alternative (using a pipe)\n"

awk 'BEGIN{val1=0;val2=0}/^[\.0]\./{val1++}/^[1-9]/{val2++}END{ print NR, val1, val1*100/NR, val2, val2*100/NR}' infile | \
while read SUM VAL1 PCT1 VAL2 PCT2
do
  # do your stuff here
  echo "Sum         : $SUM"
  echo "Val1        : $VAL1"
  echo "Percentage1 : $PCT1"
  echo "Val2        : $VAL2"
  echo "Percentage2 : $PCT2"
done

shivaa · 12-09-2012, 08:36 AM

Thanks @druuna and @grail. I actually have not yet tested it, that's why kept this post unsolved.

shivaa · 12-11-2012, 05:46 AM

Code:

# below is a useless use of echo
prctg2=$(echo | awk "{print $val2*100/$sum}")

prctg2=$(awk 'BEGIN{print $val2*100/$sum}' val2=$val2 sum=$sum)
# the above negates the problem of letting the shell interfere with any of the data

Hi Grail, as you said above, after invoking both bolow two cmds:

Code:

val2=$(awk '/^[1-9]/ {val2++} END{ print val2}' file1)
sum=$(awk 'END{print NR}' file1)

When I invoke:

Code:

prctg2=$(awk 'BEGIN{print $val2*100/$sum}' val2=$val2 sum=$sum)

It's giving me errors, like awk: division by zero or nawk: illegal field $().. . I tried simple /usr/bin/awk as well as /usr/xpg4/bin/awk. Also could you explain the use of val2=$val2 sum=$sum after print action?

shivaa · 12-11-2012, 06:40 AM

Code:

awk 'BEGIN{
  val1 = 0 ;
  val2 = 0
}
/^[\.0]\./ { val1++ } # less then 1
/^[1-9]/   { val2++ } # larger then one
END{ 
  print "sum: ",NR, "val1: ", val1, "("val1*100/NR"%)", "val2: ", val2, "("val2*100/NR"%)"
}' infile

Hi Druuna, your solution is wokring fine. But still I find myself confused with searching based on patterns, so:

1. Can I make following changes, instead of using patterns? (assuming that infile has only numerical floating numbers):

Code:

$1 < 1 { val1++ } # less then 1
$1 > 1 { val2++ } # larger then one

2. (Please do not mind if I ask that..

) Does /^[\.0]\./ means all values starting with .0? And what does \./ means here... all values that are .0. ??

druuna · 12-11-2012, 07:21 AM

Quote:

Originally Posted by shivaa

1. Can I make following changes, instead of using patterns? (assuming that infile has only numerical floating numbers):

Code:

$1 < 1 { val1++ } # less then 1
$1 > 1 { val2++ } # larger then one

Have you tried? You do need to make one of the entries look like >= or <= otherwise 1.0000 won't be detected.

Quote:

Originally Posted by shivaa

2. (Please do not mind if I ask that..

) Does /^[\.0]\./ means all values starting with .0? And what does \./ means here... all values that are .0. ??

Code:

^[\.0]\.

Values that start with a dot OR a 0 (zero) followed by a dot.

I do believe a I made a mistake in the original regexp, but it works for your data because all the entries seem to be starting with a leading zero (0.01 vs .01). It can be rewritten as:

Code:

^0\.

Code:

^[1-9]

Values that start with 1 -> 9

shivaa · 12-11-2012, 08:10 AM

Quote:

^[\.0]\. Values that start with a dot OR a 0 (zero) followed by a dot.

Ooopps... From the beginning I am considering such patterns as .0, which actually means that values beginning either with a "." or a "0", not with .0.

For instance (please correct me, if I am wrong):
^[abc] .....Means all values beginning either with a a or b or c. It does not mean all values beginning with abc! I hope it will clear all my previous doubts as well

.

Likewise, if I want to search, $1<=0.01; 0.01 < $1 < 0.1; $1 >=0.1 (i.e. 3 ranges), then also I can use such patterns using such regexp! Will sure try it.
Many thanks druuna... I am short of words! You've done a great job!!

---------------------------

Hi Grail, waiting for your response now (please refer my reply above).

druuna · 12-11-2012, 08:26 AM

Quote:

Originally Posted by shivaa

For instance (please correct me, if I am wrong):
^[abc] .....Means all values beginning either with a a or b or c. It does not mean all values beginning with abc! I hope it will clear all my previous doubts as well

.

That is correct.

You might want to revisit this site: Regex Tutorial, Examples and Reference especially: Character Classes or Character Sets

And this from the wiki page:

Quote:

[ ]
A bracket expression. Matches a single character that is contained within the brackets. For example, [abc] matches "a", "b", or "c". [a-z] specifies a range which matches any lowercase letter from "a" to "z". These forms can be mixed: [abcx-z] matches "a", "b", "c", "x", "y", or "z", as does [a-cx-z].

The - character is treated as a literal character if it is the last or the first (after the ^) character within the brackets: [abc-], [-abc]. Note that backslash escapes are not allowed. The ] character can be included in a bracket expression if it is the first (after the ^) character: []abc].

grail · 12-11-2012, 09:18 AM

Quote:

It's giving me errors, like awk: division by zero or nawk: illegal field $().. . I tried simple /usr/bin/awk as well as /usr/xpg4/bin/awk. Also could you explain the use of val2=$val2 sum=$sum after print action?

I cannot vouch for nawk. I am using gawk so maybe nawk does not like the setting of variables after. You could simply try using the -v option to set them.

Placing the setting of the variables after the quoted code is just a preference I have for setting multiple variables instead of using -v several times.

ntubski · 12-11-2012, 11:16 AM

Quote:

Originally Posted by grail

I cannot vouch for nawk. I am using gawk so maybe nawk does not like the setting of variables after. You could simply try using the -v option to set them.

You'll need to use -v for gawk as well, the plain var=val form performs the assignment after the BEGIN rule has been run:

Quote:

6.1.3.2 Assigning Variables on the Command Line

When the assignment is preceded with the -v option ... the variable is set at the very beginning, even before the BEGIN rules execute. ... Otherwise, the variable assignment is performed ... after the processing of the preceding input file argument.

grail · 12-11-2012, 06:50 PM

Thanks ntubski ... I was not aware of this variation

shivaa · 12-31-2012, 04:56 AM

Many thanks @druuna & @grail!
Ciao!