awk command to take average?

johnpaulodonnell · 02-06-2007, 10:03 AM

Hi.

I have a seismic data file which contains information recorded at a variable number of stations for a large number of earthquakes. The station number is variable as each earthquake may or may not be visible on the seismic trace of that station. The format of the file is (first two earthquakes):

eq stn rayp
---------------------------
1 D04 7.0773587
1 D06 7.0944495
1 DSB 7.2048063
1 VAL 7.012265
2 D03 5.181161
2 D07 5.2040725
2 D09 5.199767
2 D10 5.199868
2 D14 5.220343
2 D21 5.249284
2 D23 5.240766
2 DSB 5.3214226
2 VAL 5.1128297

I need to calculate an average rayp for each earthquake. eg for eq1 I want to replace the 7.xxxxx values in $3 with a single average value for that eq, and repeat this for all eq's.

In a bash script for loop I could specify:

for i in 1,.......,n
do
....... awk /^"$i"/ '{print $3}' > temp.file
....... perform averaging and so on

but I don't think that /^"$i"/ will loop in the way that I want it - when $i = 1 awk will match those lines beginning 1, 10, 11,......19, 100, 101,,,,,,199, 1000,....etc, which is not what I need. Does anyone know how I could loop over the eq#'s as an index?

Thanks.

wjevans_7d1@yahoo.co · 02-06-2007, 01:20 PM

There are several things which need to be changed in that script. Here's your original:

Code:

for i in 1,.......,n
do
....... awk /^"$i"/ '{print $3}' > temp.file
....... perform averaging and so on

The first is that the awk command itself needs to be cleaned up, even before one suggests the change you want. Try this at the command line:

Code:

awk /^"abc"/ '{print $3}'

When you do, you'll notice that awk takes the {print $3} as the name of the file on which to perform its actions. You don't want that.

awk requires that its desired action all be part of its first argument. Try this, where ^D is <Ctrl> D (press the Control key, press and release the D key, and release the <Ctrl> key:

Code:

awk '/^"abc"/ {print $3}'
abc xxx one
"abc" xxx two
^D

(In a shell keyboard context, ^D is end of data.)

At least it runs. But you'll notice that the second data line is echoed back to you, while the first is not. Why? Because what's between the slashes is what's known as a regular expression. For more information on regular expressions, see:

http://en.wikipedia.org/wiki/Regular_expression

In regular expressions, usually the slashes are used as a kind of quotation marks. When you put real quotation marks within those slashes, the quotation marks themselves are taken literally. You don't want that. So rip 'em out. Try this:

Code:

awk '/^abc/ {print $3}'
abc xxx one
"abc" xxx two
^D

You'll notice that it's now the first data line that's output. That's what you want.

Now put this in a script ...

Code:

#!/bin/sh

for earthquake_number in 1 2
do
  awk '/^$earthquake_number/ {print $3}' > temp.file
  echo === results for earthquake $earthquake_number
  cat temp.file
done

Then put this in a data file. It's exactly the data you gave in your example.

Code:

eq stn rayp
---------------------------
1 D04 7.0773587
1 D06 7.0944495
1 DSB 7.2048063
1 VAL 7.012265
2 D03 5.181161
2 D07 5.2040725
2 D09 5.199767
2 D10 5.199868
2 D14 5.220343
2 D21 5.249284
2 D23 5.240766
2 DSB 5.3214226
2 VAL 5.1128297

Before going to run this, notice that I replaced your variable i with the variable earthquake_number. You don't need to do that, but single-letter variables are a bad habit. Here's why: Eventually you'll be writing a script or program that is larger than what you're doing here. And you'll have some variable i within it. And at some point you'll want to look at all occurrences of that variable. So you tell your text editor to search for all instances of i. It will stop at everything containing i, such as "if", or longer variable names which contain i. If you adopt the rule that every variable name is more than one letter long and isn't contained within some other variable name or bash command such as "if", you'll make that search easier.

Just don't make the variable names excessively long, or you'll risk misspelling them. If you use the variable fred in your script and misspell it as fread at one point, bash won't complain; it will just treat it as a new variable, resulting in a difficult-to-find bug. So just be careful.

But I digress. Harrumph. Let's make your script executable and run it:

Code:

chmod 700 script.sh
script.sh < data.txt

You'll notice that it didn't find any of the data. Why? Because the $earthquake_number in the awk command is surrounded by single quotation marks. This isn't an awk issue; it's a bash issue. (I assume that the shell you're running is bash.) When bash sees something between single quotation marks, it passes it along without any modification. When that something is between double quotation marks, bash does its usually preprocessing wherever it sees a dollar sign ($).

So let's substitute double quotation marks on that line:

Code:

#!/bin/sh

for earthquake_number in 1 2
do
  awk "/^$earthquake_number/ {print $3}" > temp.file
  echo === results for earthquake $earthquake_number
  cat temp.file
done

Now run it. You'll notice two things wrong. One thing is that instead of containing only the third field of a line, output shows the whole line. Why? Because of those double quotes we just inserted. The shell replaces the $3 with the third command line argument. Since there were no command line arguments (let alone three of them), it replaces the $3 with exactly nothing, so

Code:

print $3

is passed along to awk as

Code:

print

The way to tell bash to interpret some of the dollar signs within double quotes as usual, but to pass others exactly as it sees them (as though they were inside single quotes), is to put a backslash before the dollar signs that should be left alone. Do that with the $3 in this script, so it looks like this:

Code:

#!/bin/sh

for earthquake_number in 1 2
do
  awk "/^$earthquake_number/ {print \$3}" > temp.file
  echo === results for earthquake $earthquake_number
  cat temp.file
done

and then run it.

Now you get just the third field in each line, which is what you want. But why don't you get any results for the second case?

The answer is that input redirection (that "<" thing) requests bash to read the redirected input once. Just once. So the second time through the loop, you've already read all the data. How do we fix this?

The answer is to put the redirection inside the loop, so it's seen each time through the loop. Remember where $3 meant the third command-line argument, even though we didn't have any command-line arguments? Well, we're going to have one command line argument now: the input file name. Change the awk line in the script so that the script now looks like this:

Code:

#!/bin/sh

for earthquake_number in 1 2
do
  awk "/^$earthquake_number/ {print \$3}" < $1 > temp.file
  echo === results for earthquake $earthquake_number
  cat temp.file
done

and run it. Not like this:

Code:

script.sh < data.txt    # incorrect

but like this, since we're not doing indirection on the command line, but instead giving the script a command-line argument (note the missing "<"):

Code:

script.sh data.txt    # correct

Now you at least get what you originally thought you'd get, although I haven't answered your original question yet. Bear with me. We'll get there, but not just yet.

Let's first address a glaring problem: In the "for earthquake_number" line, we're mentioning every earthquake number. If there were 15 earthquakes, that line would be:

Code:

for earthquake_number in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

There are several ways to make this increasingly elegant. We'll tackle just the first one now, and then get to your real question (I know you've been holding your breath), and then come back for ways to make it even better.

bash doesn't have a builtin way to construct a loop in which a counter is incremented to a limit. But we can cheat. bash does have arithmetic. Try this at the command line:

Code:

echo 1+2
echo $((1+2))

Yep. That's double parentheses, and don't put spaces between them, either. In other words, don't do this:

Code:

echo $( (1+2) )       # incorrect
echo $( ( 1+2 ) )     # incorrect

You'll notice that in the line with $((...)) you actually did arithmetic! Wow! bash does arithmetic!

Ok, now we'll modify the for loop in the script, so we have this:

Code:

#!/bin/sh

earthquake_number=0

while [[ $earthquake_number -lt 15 ]]
do
  earthquake_number=$(($earthquake_number+1))

  awk "/^$earthquake_number/ {print \$3}" < $1 > temp.file
  echo === results for $earthquake_number
  cat temp.file
done

And if you have a standard 24x80 or 25x80 window, as God intended, you'd better run that script like this ...

Code:

script.sh data.txt | less

... so the output doesn't fly off the screen. There you go. Everything for the first 15 earthquakes.

Now for the moment you've been waiting for. Let's put a snake in the grass. Let's add a new line to the end of your data, so it now looks like this:

Code:

eq stn rayp
---------------------------
1 D04 7.0773587
1 D06 7.0944495
1 DSB 7.2048063
1 VAL 7.012265
2 D03 5.181161
2 D07 5.2040725
2 D09 5.199767
2 D10 5.199868
2 D14 5.220343
2 D21 5.249284
2 D23 5.240766
2 DSB 5.3214226
2 VAL 5.1128297
10 D04 6.123456

When you run it, you see exactly the problem you posed in the first place. The data for earthquake 10 is included in that for earthquake 1 (and again processed for earthquake 10).

So let's add something to that regular expression in the awk line. Let's put a space between the $earthquake_number and the terminating slash, and see what happens. The script now looks like this:

Code:

#!/bin/sh

earthquake_number=0

while [[ $earthquake_number -lt 15 ]]
do
  earthquake_number=$(($earthquake_number+1))

  awk "/^$earthquake_number / {print \$3}" < $1 > temp.file
  echo === results for $earthquake_number
  cat temp.file
done

Run it again. You'll get exactly what you want.

So you can go now if you want. But stick around. I'll show you a couple of ways to make your script more elegant.

First, you probably don't want to edit your script to show the correct number of earthquakes every time you run it. So let the number of earthquakes be the second command-line argument when you invoke the script. The script no longer contains a "15". It looks like this:

Code:

#!/bin/sh

earthquake_number=0

while [[ $earthquake_number -lt $2 ]]
do
  earthquake_number=$(($earthquake_number+1))

  awk "/^$earthquake_number / {print \$3}" < $1 > temp.file
  echo === results for $earthquake_number
  cat temp.file
done

You run it like this:

Code:

script.sh data.txt 15 | less

Wouldn't it be even better if you didn't even have to worry about the earthquake count when you ran the script? You can handle this in one of three ways.

Assume that all the earthquakes are numbered starting at 1, with no gaps.
Let there be gaps, and skip over them.
Look for gaps, and exit with an error message if you find one.

In any of these ways, you no longer have to include the highest earthquake number as part of the command line, so you can run the command like this:

Code:

script.sh data.txt | less

I'll now look at each of these ways of figuring out when to stop.

Alternative 1: assume that all the earthquakes are numbered starting at 1, with no gaps.

First, let's change that "while" statement so it looks as though we have an infinite loop:

Code:

while true

If we run that, the program will never end. It will start with earthquake 1, move to earthquake 2, and continue thus forever. So we look for an earthquake number which is not represented in the input file. We know we've found a gap when the output file contains no data. The check looks like this:

Code:

  if [[ ! -s temp.file ]]
  then
    break
  fi

The "-s" delivers "true" if temp.file contains at least one byte. The "!" means "not", so the code means: If temp.file is empty, break out of this loop.

So the script now looks like this:

Code:

#!/bin/sh

earthquake_number=0

while true
do
  earthquake_number=$(($earthquake_number+1))

  awk "/^$earthquake_number / {print \$3}" < $1 > temp.file

  if [[ ! -s temp.file ]]
  then
    break
  fi

  echo === results for $earthquake_number
  cat temp.file
done

Run it against the modified data (the data that includes something for earthquake 10), and you'll notice that it stops after earthquake 2, because there is nothing for earthquake 3. After running the program, you can do

Code:

cat temp.file

and see that the temporary data file is empty.

Alternative 2: let there be gaps, and skip over them.

To do this, we need to find the maximum earthquake number. Try this at the command line:

Code:

sort data.txt

You'll note that it sorts the input data, but it places earthquake 10 between earthquake 1 and earthquake 2. We want a numeric sort with the highest earthquake number at the end. So try this:

Code:

sort -n data.txt

-n means numerical sort, instead of sort by string value.

You'll notice that the highest-numbered earthquake appears at the end. Furthermore, that pesky "eq stn rayp" line is no longer at the end, but has been moved near the beginning.

Now we want just the final line, so do this:

Code:

sort -n data.txt | tail -1

But we want just the first field of that, so do this:

Code:

sort -n data.txt | tail -1 | awk '{print $1}'

Presto! All you have now is the highest earthquake number! Now comes the real magic:

Code:

final_earthquake=$(sort -n data.txt | tail -1 | awk '{print $1}')
echo $final_earthquake

What happened here?

Remember the $((...)) construct that we used for arithmetic? Well, $(...), with single parentheses instead of double, does something entirely different. It executes what's between the parentheses as a separate shell command, takes the output, and plugs that in where the $(...) was!

So if the output of

Code:

sort -n data.txt | tail -1 | awk '{print $1}'

is 10, then what we did with final_earthquake is as though we had typed:

Code:

final_earthquake=10
echo $final_earthquake

So let's change our script in three ways:

Insert that computation of the final earthquake number.
Use that final earthquake number as our termination test.
When we find an empty temp.file, just skip over that and continue, rather than exiting.

Our script now looks like this:

Code:

#!/bin/sh

final_earthquake=$(sort -n data.txt | tail -1 | awk '{print $1}')

earthquake_number=0

while [[ $earthquake_number -lt $final_earthquake ]]
do
  earthquake_number=$(($earthquake_number+1))

  awk "/^$earthquake_number / {print \$3}" < $1 > temp.file

  if [[ ! -s temp.file ]]
  then
    continue
  fi

  echo === results for $earthquake_number
  cat temp.file
done

When you run it, you'll notice that it outputs results for earthquakes 1, 2, and 10. Just what you want, right?

Except that you're probably running in an environment where if earthquake numbers are missing from the file, you want to come to a screeching halt, because there's something wrong with your data collection algorithm. This brings us to ...

Alternative 3: Look for gaps, and exit with an error message if you find one.

Just change what you do when you find an empty temp.file.

Code:

  if [[ ! -s temp.file ]]
  then
    echo "missing earthquake $earthquake_number!!!"
    exit 1
  fi

So the script looks like this:

Code:

#!/bin/sh

final_earthquake=$(sort -n data.txt | tail -1 | awk '{print $1}')

earthquake_number=0

while [[ $earthquake_number -lt $final_earthquake ]]
do
  earthquake_number=$(($earthquake_number+1))

  awk "/^$earthquake_number / {print \$3}" < $1 > temp.file

  if [[ ! -s temp.file ]]
  then
    echo "missing earthquake $earthquake_number!!!"
    exit 1
  fi

  echo === results for $earthquake_number
  cat temp.file
done

There ya go.

Hope this helps.

colucix · 02-06-2007, 05:49 PM

Here is my

Code:

BEGIN { getline head1 ; getline head2 }
{ 
  eq[$1] = eq[$1] + $3 ;
  cc[$1] += 1 ;
  line = line + 1 ;
  one[line] = $1 ;
  two[line] = $2 ;
}
END {
  for (x in eq) mean[x] = eq[x] / cc[x] ;
  print head1 ;
  print head2 ;
  for (i =1; i <= line; i++)
      printf "%3d %s %11.7f\n",one[i],two[i],mean[one[i]]
}

This awk code increment the sum of rayp for each eq, store fields 1 and 2 from each line and finally compute the averages and print them out (together with the number and code of each earthquake as in the original input file).

johnpaulodonnell · 02-07-2007, 04:11 AM

Many, many thanks for all that. Really appreciate it!