shell command using awk fields inside awk

one71 · 06-25-2008, 03:29 AM

Hi,

I am not an awk specialist. In short I want to write a script (bash/awk) which reads the squid access output file, sorts unique it using only the destination "hosts" and then prints out:
how many times destination host was requested, % of the requests, destination host name, destination host IP.
This all without writing in tmp files.

Say the input is:

Code:

1196377810.470    405 10.4.1.119 TCP_MISS/200 410 POST http://shttp.msg.yahoo.com/notify/ - DEFAULT_PARENT/localhost text/plain
1196377805.218   6260 10.1.50.237 TCP_MISS/502 1419 GET http://einstein.aei.mpg.de/download/3e7/h1_0646.35_S5R2 - ANY_PARENT/localhost text/html
1196377808.611   2651 10.1.50.237 TCP_MISS/502 1429 GET http://einstein.astro.gla.ac.uk/download/28d/l1_0646.40_S5R2 - ANY_PARENT/localhost text/html
1196377808.666  25343 10.1.24.144 TCP_MISS/200 226 GET http://41.246.102.212/din.aspx? - DIRECT/41.246.102.212 application/octet-stream

I have written a code which outputs:

Code:

1        25%     shttp.msg.yahoo.com
1        25%     einstein.astro.gla.ac.uk
1        25%     einstein.aei.mpg.de
1        25%     41.246.102.212

The code is:

Code:

#! /bin/bash
FILEIN=./access1.log
FILEOUT=./out.log

let WCTOT=`wc -l $FILEIN |awk '{print $1}'`

awk '{print $7}'  $FILEIN | \
sed -e 's/http:\/\///g' -e 's/https:\/\///g' -e 's/\/.*//' | \
sort | \
uniq -c | \
sort -rn | \
awk '{print $1, "\t", ($1*100)/"'$WCTOT'" "%", "\t", $2}' > $FILEOUT

As you see I "just" miss the host name -> host IP resolution but I do not have any idea how to get it!

I mean I should put "somewhere" in the last line of the code (inside the awk) the command:

Code:

host -t a $2 | grep address | awk '{print $4}'

and let print the result of this as the last column.

Conceptually something like:

Code:

awk '{ABC=`host -t a $2 | grep address | awk '{print $4}'`; {print $1, "\t", ($1*100)/"'$WCTOT'" "%", "\t", $2, $ABC}' > $FILEOUT

although I know that the syntax is wrong!

Do you know a way to solve this problem?

Thanks

radoulov · 06-25-2008, 10:56 AM

If Perl is acceptable (the output is not sorted, you want to sort by what?):

Code:

perl -MSocket -lane'
  $x{$1}++ if m|http://(.*?)/|;
END {
    printf "%02.f%% %s %s\n",
  $x{$_}/$. * 100,
    $_ ,
  inet_ntoa(scalar gethostbyname($_)) 
    for reverse sort {$x{$a}<=>$x{$b}} keys  %x 
}' file

If you want to sort by number of requests:

Code:

perl -MSocket -lane'
  $x{$1}++ if m|http://(.*?)/|;
END {
    printf "%02.f%% %s %s\n",
  $x{$_}/$. * 100,
    $_ ,
  inet_ntoa(scalar gethostbyname($_)) 
    for reverse sort {$x{$a}<=>$x{$b}} keys  %x 
}' file

one71 · 06-26-2008, 12:29 AM

Hi,

thanks for your code.

Yes I want to sort the requests from the one with more requests to the one with less requests
For the practical use perl could be ok, but I would really appreciate if someone gives me an hint how to do the same with bash/awk. The code that I have pasted might not be elegant but is extremelly performing (80MB in 2 sec) and I would really like to "extend it".
I have tried your code: for very small files (4 lines/4 different host requetsd) it works, but already on a file with 100 lines it breaks up after short with:
Code:
```
Bad arg length for Socket::inet_ntoa, length is 0, should be 4 at -e line 8, <> line 449197.
END failed--call queue aborted, <> line 449197.
```
(and it misses one column:
- column 1= how many times destination host was requested
- column 2= % of the requests
- column 3= destination host name
- column 4= destination host IP
)

any idea?

radoulov · 06-26-2008, 02:52 AM

I understand, try this:

Code:

perl -MSocket -lne'
  $x{$1}++ if m|http://([^/]+?)/|;
END {
  for (reverse sort {$x{$a}<=>$x{$b}} keys  %x) {
    $ip = gethostbyname($_); 
    defined $ip and $ip = inet_ntoa($ip) or $ip = "N/A(invalid host?)"; 
    printf "%d\t%02.f%%\t%-30s\t%s\n",
      $x{$_},
        $x{$_}/$. * 100,
          $_ ,
            $ip 
  }
}' file

Yes, gethostbyname is slow ...
I suppose your code is fast because you don't do the host resolution.

If this still errors, could you attach a biggest sample of your log?

Mr. C. · 06-26-2008, 02:59 AM

Using awk to do this is really ugly, and FAR more expensive. You'll be calling host, and having to parse the output from within awk, but that's not easy, as the output

The perl script is more efficient, and has a minor error - don't toss the baby out with the bathwater.

What is the input line where the script is exiting?

If you don't understand, or want the perl scripts above, try just this last script to do your hostname -> IP translation. Add it to the end of your pipeline:

Code:

 ... your stuff here ... | 
  perl -MSocket -lane '
    if ($F[2] !~ /^(\d+\.\d+\.\d+\.\d+)$/) {
        $name2ip{$F[2]} = inet_ntoa(scalar gethostbyname($F[2])) if ! exists $name2ip{$F[2]};
        $F[2] = $name2ip{$F[2]};
    }
    printf "%02.f%% %s %s\n", @F;
  '

radoulov · 06-26-2008, 03:54 PM

With AWK and sort (not elegant, as already stated):

Code:

awk -F'http://' '{
  sub(/\/.*/, "", $2)
  _[$2]++
  }
END {
  for (h in _) {
    ips = ""
    if (h)
    while ((("host " h) | getline) > 0)
      if ($0 ~ /address/) {    
      n = split($0, t, OFS)
      ips = ips ? ips "," t[n] : t[n]
      }
    close("host ")  
    printf "%d\t%.2f\t%-30s\t%s\n", 
      _[h], _[h]/NR*100, h ? h : "invalid host", ips ? ips : "N/A"    
  }
}' file|sort -nr

You can even do the numeric sort inside awk, but it's too much work for nothing.

Mr. C. · 06-26-2008, 04:11 PM

Thats a total of 10 processes + 1 for each host call, and 10 passes through the data, including 3 sorts on the data. Very nasty.

Radoulov - nice work!