LinuxQuestions.org
Visit the LQ Articles and Editorials section
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 08-28-2006, 02:18 AM   #1
hedpe
Member
 
Registered: Jan 2005
Location: Pittsburgh
Distribution: Ubuntu
Posts: 378

Rep: Reputation: 30
need help fixing and optimizing AWK script


Hey guys,

I have a pretty ugly bash/awk script, that does not exactly do what I'd like it to do.

First off, the program needs parse 25274 files, with about 30,000 lines each

the basic structure of the program is:
Code:
for file in ../../*_blah; do
    # determine what file to read
    file=....;
    zcat $file | awk '{ do lots of stuff }'
done
So heres are the problems:
- The files I want to read from are compressed
- I cannot temporarily decompress all the files before runtime (500GB+ of compressed data)
- I have arrays of information in each awk run, that I want to keep persistent

The last one is very important, I am keeping counts of when I see variables, like counts[$2]++; ... and i want to keep this count information availible for the next time awk runs... but instead when it hits the top of the loop, awk of course forgets all its variables

So what I need is something like this:
Code:
awk '{
    for(all_my_compressed_files) {
        # determine file to read
        # read in compressed file
        do lots of stuff;
    }
}'
I essentially need to remove individual piping of each file to awk through zcat... this would allow me only have 1 instance of awk and variables would be persistent

Here is a typical line of input:
Code:
1104969276 1104969276  udp     0.2.132.134.54446   ->     97.153.58.99.21501 1        0         65           0           INT
So here is my *ugly* code... please optimize anything you see that can be optimized. It has to parse a lot of data, so anything helps.
Code:
#!/bin/bash
LEVEL1_DIR=$1
TMP_OUT=.tmp_blinc_fsd_bytes

for blinc in $LEVEL1_DIR/*_blinc.gz; do

  gunzip $blinc
  blinc=${blinc%.*}
  tmp=${blinc##*/}; tmp=${tmp%%_*}
  file=$(grep $tmp ../files)
  echo $file

  echo $file
  zcat "$file" |
      awk '
          BEGIN {
              # Read the complete _blinc file at once
              blinc = "'"$blinc"'";
              while (getline < blinc) {
                  codeof[$1 "." $2] = $3;
              }
              close(blinc);
          }

          # Process input line
          $1 != "StartTime" && NF == 11 {
              leftaddr = $4;
              rightaddr = $6;

              if (leftaddr in codeof) {
                  code = codeof[leftaddr];
                  port=leftaddr;

                  # Extract the port
                  while(loc=index(port,".")) { 
                    port=substr(port,loc+1);
                  }
                  port=substr(port,loc+1);

              } else if (rightaddr in codeof) {
                  code = codeof[rightaddr];
                  port=rightaddr;

                  # Extract the port
                  while(loc=index(port,".")) { 
                    port=substr(port,loc+1);
                  }
                  port=substr(port,loc+1);

              }
              
              # Need to store counts
              bytes[code "," $9+$10]++;
              packets[code "," $7+$8]++;
              time[code "," $2-$1]++;
              ports[code "," port]++;
              
          }

          END {
            for(item in bytes) {
                loc=index(item, ",");
                code=substr(item, 0, loc-1);
                value=substr(item, loc+1);
                printf("%.0f %.0f\n", bytes[item], value) > "fsd_bytes_" code;
            }
            for(item in packets) {
                loc=index(item, ",");
                code=substr(item, 0, loc-1);
                value=substr(item, loc+1);
                printf("%.0f %.0f\n", packets[item], value) > "fsd_packets_" code;
            }
            for(item in time) {
                loc=index(item, ",");
                code=substr(item, 0, loc-1);
                value=substr(item, loc+1);
                printf("%.0f %.0f\n", time[item], value) > "fsd_time_" code;
            }
            for(item in ports) {
                loc=index(item, ",");
                code=substr(item, 0, loc-1);
                value=substr(item, loc+1);
                printf("%.0f %.0f\n", ports[item], value) > "fsd_ports_" code;
            }
          }
      '

  gzip $blinc

done
If you notice, awk will re-initialize variables each time through the loop, and i'm piping the compressed files with zcat. This is what I need to get rid of. AWK needs to somehow read the list of files, and read their data uncompressed somehow.

I'd greatly appreciate any help.

Thanks!
George
 
Old 08-28-2006, 03:51 AM   #2
chrism01
Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.5, Centos 5.10
Posts: 16,311

Rep: Reputation: 2040Reputation: 2040Reputation: 2040Reputation: 2040Reputation: 2040Reputation: 2040Reputation: 2040Reputation: 2040Reputation: 2040Reputation: 2040Reputation: 2040
If you're doing that many files with that many recs each, I strongly recommend you use a non-interpreted lang eg Perl is perfect for this.
You can get perl to zcat each file as you go, so you don't need to worry about diskspace.
 
Old 08-28-2006, 09:36 AM   #3
hedpe
Member
 
Registered: Jan 2005
Location: Pittsburgh
Distribution: Ubuntu
Posts: 378

Original Poster
Rep: Reputation: 30
now, if i only knew perl *said like if i only had a brain*

shouldn't be too hard, i'll try to get started on it... if anyone feels like helping me port any of it, feel free to share lines/functions

Last edited by hedpe; 08-28-2006 at 09:49 AM.
 
Old 08-28-2006, 01:24 PM   #4
indienick
Senior Member
 
Registered: Dec 2005
Location: London, ON, Canada
Distribution: Arch, Ubuntu, Slackware, OpenBSD, FreeBSD
Posts: 1,853

Rep: Reputation: 65
Quote:
Originally Posted by chrism01
I strongly recommend you use a non-interpreted lang eg Perl is perfect for this.
Unless I'm mistaken, isn't Perl for the most part is an interpreted language? I know it can be compiled, but I've yet to see a need to.

Here's an outline of a Perl script that may help you, hedpe.
Code:
#!/usr/bin/perl -w
# Be sure to run this in the current directory.

# This gathers all the files in the current directory.
# Change * to filter for certain file names.
@filelist = glob("*");

foreach $item (@filelist) {
   open (FILE, $item);
   # This will loop until it hits an EOF character.
   while(<FILE>) {
      # This is where all your processing and parsing
      # will take place.
      # I don't know AWK, so I'm afraid I can't really
      # translate the code for you. =(
   }
}

Last edited by indienick; 08-28-2006 at 01:25 PM.
 
Old 08-28-2006, 11:52 PM   #5
chrism01
Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.5, Centos 5.10
Posts: 16,311

Rep: Reputation: 2040Reputation: 2040Reputation: 2040Reputation: 2040Reputation: 2040Reputation: 2040Reputation: 2040Reputation: 2040Reputation: 2040Reputation: 2040Reputation: 2040
No, a lot of people make that mistake ...
Actually, what happens is that the 'perl program' eg /usr/bin/perl , takes your perl script and 'compiles it' in memory, and then runs the 'compiled' version.
For full explanation, see http://www.perl.com/doc/FMTEYEWTK/comp-vs-interp.html, but basically you end up with something that runs nearly as fast as a compiled C prog.
My off-the-cuff guess is 85-90%.
It's also 1 process, as opposed to the large num you'd get from the OP's design.
The lang itself is like C, but without needing to know ptrs (although you can have refs, which are similar) & string issues (ie knowing how long to make them etc) are taken care of for you.
See my note here: http://www.linuxquestions.org/questi...d.php?t=477970
 
Old 08-29-2006, 09:30 AM   #6
indienick
Senior Member
 
Registered: Dec 2005
Location: London, ON, Canada
Distribution: Arch, Ubuntu, Slackware, OpenBSD, FreeBSD
Posts: 1,853

Rep: Reputation: 65
Ohhh ok. Thanks for that chrism01.
My dad's Perl book has lied to me then.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
sed and awk in shell script bondoq Linux - Newbie 14 07-27-2007 02:52 AM
About awk script sachin_keluskar Linux - Software 2 06-24-2005 04:19 AM
Passing variables from AWK script to my shell script BigLarry Programming 1 06-12-2004 05:32 AM
can somebody help me in fixing the script jdara1 Linux - General 2 10-23-2003 01:18 PM
How do I run an awk script? davee Programming 2 08-12-2003 09:46 AM


All times are GMT -5. The time now is 01:42 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration