LinuxQuestions.org
Visit the LQ Articles and Editorials section
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 06-25-2008, 03:29 AM   #1
one71
LQ Newbie
 
Registered: May 2008
Posts: 6

Rep: Reputation: 0
shell command using awk fields inside awk


Hi,

I am not an awk specialist. In short I want to write a script (bash/awk) which reads the squid access output file, sorts unique it using only the destination "hosts" and then prints out:
how many times destination host was requested, % of the requests, destination host name, destination host IP.
This all without writing in tmp files.

Say the input is:

Code:
1196377810.470    405 10.4.1.119 TCP_MISS/200 410 POST http://shttp.msg.yahoo.com/notify/ - DEFAULT_PARENT/localhost text/plain
1196377805.218   6260 10.1.50.237 TCP_MISS/502 1419 GET http://einstein.aei.mpg.de/download/3e7/h1_0646.35_S5R2 - ANY_PARENT/localhost text/html
1196377808.611   2651 10.1.50.237 TCP_MISS/502 1429 GET http://einstein.astro.gla.ac.uk/download/28d/l1_0646.40_S5R2 - ANY_PARENT/localhost text/html
1196377808.666  25343 10.1.24.144 TCP_MISS/200 226 GET http://41.246.102.212/din.aspx? - DIRECT/41.246.102.212 application/octet-stream
I have written a code which outputs:
Code:
1        25%     shttp.msg.yahoo.com
1        25%     einstein.astro.gla.ac.uk
1        25%     einstein.aei.mpg.de
1        25%     41.246.102.212
The code is:

Code:
#! /bin/bash
FILEIN=./access1.log
FILEOUT=./out.log

let WCTOT=`wc -l $FILEIN |awk '{print $1}'`

awk '{print $7}'  $FILEIN | \
sed -e 's/http:\/\///g' -e 's/https:\/\///g' -e 's/\/.*//' | \
sort | \
uniq -c | \
sort -rn | \
awk '{print $1, "\t", ($1*100)/"'$WCTOT'" "%", "\t", $2}' > $FILEOUT
As you see I "just" miss the host name -> host IP resolution but I do not have any idea how to get it!

I mean I should put "somewhere" in the last line of the code (inside the awk) the command:
Code:
host -t a $2 | grep address | awk '{print $4}'
and let print the result of this as the last column.

Conceptually something like:

Code:
awk '{ABC=`host -t a $2 | grep address | awk '{print $4}'`; {print $1, "\t", ($1*100)/"'$WCTOT'" "%", "\t", $2, $ABC}' > $FILEOUT
although I know that the syntax is wrong!

Do you know a way to solve this problem?

Thanks
 
Old 06-25-2008, 10:56 AM   #2
radoulov
Member
 
Registered: Apr 2007
Location: Milano, Italia/Варна, България
Distribution: Ubuntu, Open SUSE
Posts: 212

Rep: Reputation: 35
If Perl is acceptable (the output is not sorted, you want to sort by what?):
Code:
perl -MSocket -lane'
  $x{$1}++ if m|http://(.*?)/|;
END {
    printf "%02.f%% %s %s\n",
  $x{$_}/$. * 100,
    $_ ,
  inet_ntoa(scalar gethostbyname($_)) 
    for reverse sort {$x{$a}<=>$x{$b}} keys  %x 
}' file
If you want to sort by number of requests:

Code:
perl -MSocket -lane'
  $x{$1}++ if m|http://(.*?)/|;
END {
    printf "%02.f%% %s %s\n",
  $x{$_}/$. * 100,
    $_ ,
  inet_ntoa(scalar gethostbyname($_)) 
    for reverse sort {$x{$a}<=>$x{$b}} keys  %x 
}' file

Last edited by radoulov; 06-25-2008 at 02:03 PM.
 
Old 06-26-2008, 12:29 AM   #3
one71
LQ Newbie
 
Registered: May 2008
Posts: 6

Original Poster
Rep: Reputation: 0
Hi,

thanks for your code.
  1. Yes I want to sort the requests from the one with more requests to the one with less requests
  2. For the practical use perl could be ok, but I would really appreciate if someone gives me an hint how to do the same with bash/awk. The code that I have pasted might not be elegant but is extremelly performing (80MB in 2 sec) and I would really like to "extend it".
  3. I have tried your code: for very small files (4 lines/4 different host requetsd) it works, but already on a file with 100 lines it breaks up after short with:

    Code:
    Bad arg length for Socket::inet_ntoa, length is 0, should be 4 at -e line 8, <> line 449197.
    END failed--call queue aborted, <> line 449197.
    (and it misses one column:
    • column 1= how many times destination host was requested
    • column 2= % of the requests
    • column 3= destination host name
    • column 4= destination host IP
    )

any idea?
 
Old 06-26-2008, 02:52 AM   #4
radoulov
Member
 
Registered: Apr 2007
Location: Milano, Italia/Варна, България
Distribution: Ubuntu, Open SUSE
Posts: 212

Rep: Reputation: 35
I understand, try this:

Code:
perl -MSocket -lne'
  $x{$1}++ if m|http://([^/]+?)/|;
END {
  for (reverse sort {$x{$a}<=>$x{$b}} keys  %x) {
    $ip = gethostbyname($_); 
    defined $ip and $ip = inet_ntoa($ip) or $ip = "N/A(invalid host?)"; 
    printf "%d\t%02.f%%\t%-30s\t%s\n",
      $x{$_},
        $x{$_}/$. * 100,
          $_ ,
            $ip 
  }
}' file
Yes, gethostbyname is slow ...
I suppose your code is fast because you don't do the host resolution.

If this still errors, could you attach a biggest sample of your log?

Last edited by radoulov; 06-26-2008 at 03:04 AM.
 
Old 06-26-2008, 02:59 AM   #5
Mr. C.
Senior Member
 
Registered: Jun 2008
Posts: 2,529

Rep: Reputation: 59
Using awk to do this is really ugly, and FAR more expensive. You'll be calling host, and having to parse the output from within awk, but that's not easy, as the output

The perl script is more efficient, and has a minor error - don't toss the baby out with the bathwater.

What is the input line where the script is exiting?

If you don't understand, or want the perl scripts above, try just this last script to do your hostname -> IP translation. Add it to the end of your pipeline:
Code:
 ... your stuff here ... | 
  perl -MSocket -lane '
    if ($F[2] !~ /^(\d+\.\d+\.\d+\.\d+)$/) {
        $name2ip{$F[2]} = inet_ntoa(scalar gethostbyname($F[2])) if ! exists $name2ip{$F[2]};
        $F[2] = $name2ip{$F[2]};
    }
    printf "%02.f%% %s %s\n", @F;
  '
 
Old 06-26-2008, 03:54 PM   #6
radoulov
Member
 
Registered: Apr 2007
Location: Milano, Italia/Варна, България
Distribution: Ubuntu, Open SUSE
Posts: 212

Rep: Reputation: 35
With AWK and sort (not elegant, as already stated):

Code:
awk -F'http://' '{
  sub(/\/.*/, "", $2)
  _[$2]++
  }
END {
  for (h in _) {
    ips = ""
    if (h)
    while ((("host " h) | getline) > 0)
      if ($0 ~ /address/) {    
      n = split($0, t, OFS)
      ips = ips ? ips "," t[n] : t[n]
      }
    close("host ")  
    printf "%d\t%.2f\t%-30s\t%s\n", 
      _[h], _[h]/NR*100, h ? h : "invalid host", ips ? ips : "N/A"    
  }
}' file|sort -nr
You can even do the numeric sort inside awk, but it's too much work for nothing.

Last edited by radoulov; 06-26-2008 at 04:15 PM.
 
Old 06-26-2008, 04:11 PM   #7
Mr. C.
Senior Member
 
Registered: Jun 2008
Posts: 2,529

Rep: Reputation: 59
Thats a total of 10 processes + 1 for each host call, and 10 passes through the data, including 3 sorts on the data. Very nasty.

Radoulov - nice work!
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
awk question on handling *.CSV "text fields" in awk jschiwal Programming 8 05-27-2010 06:23 AM
How to use awk command to parse fields in a line johnsanty Programming 9 05-25-2006 09:56 PM
Supressing Fields w/ AWK Rv5 Programming 3 10-19-2004 11:06 AM
How to run a shell command containing awk, and grep within a C program Linh Programming 3 06-05-2003 07:05 PM
Running a shell command containing awk and grep within a C program Linh Programming 1 06-05-2003 06:51 PM


All times are GMT -5. The time now is 11:44 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration