ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I have a pretty ugly bash/awk script, that does not exactly do what I'd like it to do.
First off, the program needs parse 25274 files, with about 30,000 lines each
the basic structure of the program is:
Code:
for file in ../../*_blah; do
# determine what file to read
file=....;
zcat $file | awk '{ do lots of stuff }'
done
So heres are the problems:
- The files I want to read from are compressed
- I cannot temporarily decompress all the files before runtime (500GB+ of compressed data)
- I have arrays of information in each awk run, that I want to keep persistent
The last one is very important, I am keeping counts of when I see variables, like counts[$2]++; ... and i want to keep this count information availible for the next time awk runs... but instead when it hits the top of the loop, awk of course forgets all its variables
So what I need is something like this:
Code:
awk '{
for(all_my_compressed_files) {
# determine file to read
# read in compressed file
do lots of stuff;
}
}'
I essentially need to remove individual piping of each file to awk through zcat... this would allow me only have 1 instance of awk and variables would be persistent
So here is my *ugly* code... please optimize anything you see that can be optimized. It has to parse a lot of data, so anything helps.
Code:
#!/bin/bash
LEVEL1_DIR=$1
TMP_OUT=.tmp_blinc_fsd_bytes
for blinc in $LEVEL1_DIR/*_blinc.gz; do
gunzip $blinc
blinc=${blinc%.*}
tmp=${blinc##*/}; tmp=${tmp%%_*}
file=$(grep $tmp ../files)
echo $file
echo $file
zcat "$file" |
awk '
BEGIN {
# Read the complete _blinc file at once
blinc = "'"$blinc"'";
while (getline < blinc) {
codeof[$1 "." $2] = $3;
}
close(blinc);
}
# Process input line
$1 != "StartTime" && NF == 11 {
leftaddr = $4;
rightaddr = $6;
if (leftaddr in codeof) {
code = codeof[leftaddr];
port=leftaddr;
# Extract the port
while(loc=index(port,".")) {
port=substr(port,loc+1);
}
port=substr(port,loc+1);
} else if (rightaddr in codeof) {
code = codeof[rightaddr];
port=rightaddr;
# Extract the port
while(loc=index(port,".")) {
port=substr(port,loc+1);
}
port=substr(port,loc+1);
}
# Need to store counts
bytes[code "," $9+$10]++;
packets[code "," $7+$8]++;
time[code "," $2-$1]++;
ports[code "," port]++;
}
END {
for(item in bytes) {
loc=index(item, ",");
code=substr(item, 0, loc-1);
value=substr(item, loc+1);
printf("%.0f %.0f\n", bytes[item], value) > "fsd_bytes_" code;
}
for(item in packets) {
loc=index(item, ",");
code=substr(item, 0, loc-1);
value=substr(item, loc+1);
printf("%.0f %.0f\n", packets[item], value) > "fsd_packets_" code;
}
for(item in time) {
loc=index(item, ",");
code=substr(item, 0, loc-1);
value=substr(item, loc+1);
printf("%.0f %.0f\n", time[item], value) > "fsd_time_" code;
}
for(item in ports) {
loc=index(item, ",");
code=substr(item, 0, loc-1);
value=substr(item, loc+1);
printf("%.0f %.0f\n", ports[item], value) > "fsd_ports_" code;
}
}
'
gzip $blinc
done
If you notice, awk will re-initialize variables each time through the loop, and i'm piping the compressed files with zcat. This is what I need to get rid of. AWK needs to somehow read the list of files, and read their data uncompressed somehow.
If you're doing that many files with that many recs each, I strongly recommend you use a non-interpreted lang eg Perl is perfect for this.
You can get perl to zcat each file as you go, so you don't need to worry about diskspace.
I strongly recommend you use a non-interpreted lang eg Perl is perfect for this.
Unless I'm mistaken, isn't Perl for the most part is an interpreted language? I know it can be compiled, but I've yet to see a need to.
Here's an outline of a Perl script that may help you, hedpe.
Code:
#!/usr/bin/perl -w
# Be sure to run this in the current directory.
# This gathers all the files in the current directory.
# Change * to filter for certain file names.
@filelist = glob("*");
foreach $item (@filelist) {
open (FILE, $item);
# This will loop until it hits an EOF character.
while(<FILE>) {
# This is where all your processing and parsing
# will take place.
# I don't know AWK, so I'm afraid I can't really
# translate the code for you. =(
}
}
No, a lot of people make that mistake ...
Actually, what happens is that the 'perl program' eg /usr/bin/perl , takes your perl script and 'compiles it' in memory, and then runs the 'compiled' version.
For full explanation, see http://www.perl.com/doc/FMTEYEWTK/comp-vs-interp.html, but basically you end up with something that runs nearly as fast as a compiled C prog.
My off-the-cuff guess is 85-90%.
It's also 1 process, as opposed to the large num you'd get from the OP's design.
The lang itself is like C, but without needing to know ptrs (although you can have refs, which are similar) & string issues (ie knowing how long to make them etc) are taken care of for you.
See my note here: http://www.linuxquestions.org/questi...d.php?t=477970
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.