need help fixing and optimizing AWK script

hedpe · 08-28-2006, 01:18 AM

Hey guys,

I have a pretty ugly bash/awk script, that does not exactly do what I'd like it to do.

First off, the program needs parse 25274 files, with about 30,000 lines each

the basic structure of the program is:

Code:

for file in ../../*_blah; do
    # determine what file to read
    file=....;
    zcat $file | awk '{ do lots of stuff }'
done

So heres are the problems:
- The files I want to read from are compressed
- I cannot temporarily decompress all the files before runtime (500GB+ of compressed data)
- I have arrays of information in each awk run, that I want to keep persistent

The last one is very important, I am keeping counts of when I see variables, like counts[$2]++; ... and i want to keep this count information availible for the next time awk runs... but instead when it hits the top of the loop, awk of course forgets all its variables

So what I need is something like this:

Code:

awk '{
    for(all_my_compressed_files) {
        # determine file to read
        # read in compressed file
        do lots of stuff;
    }
}'

I essentially need to remove individual piping of each file to awk through zcat... this would allow me only have 1 instance of awk and variables would be persistent

Here is a typical line of input:

Code:

1104969276 1104969276  udp     0.2.132.134.54446   ->     97.153.58.99.21501 1        0         65           0           INT

So here is my *ugly* code... please optimize anything you see that can be optimized. It has to parse a lot of data, so anything helps.

Code:

#!/bin/bash
LEVEL1_DIR=$1
TMP_OUT=.tmp_blinc_fsd_bytes

for blinc in $LEVEL1_DIR/*_blinc.gz; do

  gunzip $blinc
  blinc=${blinc%.*}
  tmp=${blinc##*/}; tmp=${tmp%%_*}
  file=$(grep $tmp ../files)
  echo $file

  echo $file
  zcat "$file" |
      awk '
          BEGIN {
              # Read the complete _blinc file at once
              blinc = "'"$blinc"'";
              while (getline < blinc) {
                  codeof[$1 "." $2] = $3;
              }
              close(blinc);
          }

          # Process input line
          $1 != "StartTime" && NF == 11 {
              leftaddr = $4;
              rightaddr = $6;

              if (leftaddr in codeof) {
                  code = codeof[leftaddr];
                  port=leftaddr;

                  # Extract the port
                  while(loc=index(port,".")) { 
                    port=substr(port,loc+1);
                  }
                  port=substr(port,loc+1);

              } else if (rightaddr in codeof) {
                  code = codeof[rightaddr];
                  port=rightaddr;

                  # Extract the port
                  while(loc=index(port,".")) { 
                    port=substr(port,loc+1);
                  }
                  port=substr(port,loc+1);

              }
              
              # Need to store counts
              bytes[code "," $9+$10]++;
              packets[code "," $7+$8]++;
              time[code "," $2-$1]++;
              ports[code "," port]++;
              
          }

          END {
            for(item in bytes) {
                loc=index(item, ",");
                code=substr(item, 0, loc-1);
                value=substr(item, loc+1);
                printf("%.0f %.0f\n", bytes[item], value) > "fsd_bytes_" code;
            }
            for(item in packets) {
                loc=index(item, ",");
                code=substr(item, 0, loc-1);
                value=substr(item, loc+1);
                printf("%.0f %.0f\n", packets[item], value) > "fsd_packets_" code;
            }
            for(item in time) {
                loc=index(item, ",");
                code=substr(item, 0, loc-1);
                value=substr(item, loc+1);
                printf("%.0f %.0f\n", time[item], value) > "fsd_time_" code;
            }
            for(item in ports) {
                loc=index(item, ",");
                code=substr(item, 0, loc-1);
                value=substr(item, loc+1);
                printf("%.0f %.0f\n", ports[item], value) > "fsd_ports_" code;
            }
          }
      '

  gzip $blinc

done

If you notice, awk will re-initialize variables each time through the loop, and i'm piping the compressed files with zcat. This is what I need to get rid of. AWK needs to somehow read the list of files, and read their data uncompressed somehow.

I'd greatly appreciate any help.

Thanks!
George

chrism01 · 08-28-2006, 02:51 AM

If you're doing that many files with that many recs each, I strongly recommend you use a non-interpreted lang eg Perl is perfect for this.
You can get perl to zcat each file as you go, so you don't need to worry about diskspace.

hedpe · 08-28-2006, 08:36 AM

now, if i only knew perl

*said like if i only had a brain*

shouldn't be too hard, i'll try to get started on it... if anyone feels like helping me port any of it, feel free to share lines/functions

indienick · 08-28-2006, 12:24 PM

Quote:

Originally Posted by chrism01

I strongly recommend you use a non-interpreted lang eg Perl is perfect for this.

Unless I'm mistaken, isn't Perl for the most part is an interpreted language? I know it can be compiled, but I've yet to see a need to.

Here's an outline of a Perl script that may help you, hedpe.

Code:

#!/usr/bin/perl -w
# Be sure to run this in the current directory.

# This gathers all the files in the current directory.
# Change * to filter for certain file names.
@filelist = glob("*");

foreach $item (@filelist) {
   open (FILE, $item);
   # This will loop until it hits an EOF character.
   while(<FILE>) {
      # This is where all your processing and parsing
      # will take place.
      # I don't know AWK, so I'm afraid I can't really
      # translate the code for you. =(
   }
}

chrism01 · 08-28-2006, 10:52 PM

No, a lot of people make that mistake ...
Actually, what happens is that the 'perl program' eg /usr/bin/perl , takes your perl script and 'compiles it' in memory, and then runs the 'compiled' version.
For full explanation, see http://www.perl.com/doc/FMTEYEWTK/comp-vs-interp.html, but basically you end up with something that runs nearly as fast as a compiled C prog.
My off-the-cuff guess is 85-90%.
It's also 1 process, as opposed to the large num you'd get from the OP's design.
The lang itself is like C, but without needing to know ptrs (although you can have refs, which are similar) & string issues (ie knowing how long to make them etc) are taken care of for you.
See my note here: http://www.linuxquestions.org/questi...d.php?t=477970

indienick · 08-29-2006, 08:30 AM

Ohhh ok. Thanks for that chrism01.

My dad's Perl book has lied to me then.