LinuxQuestions.org - awk speed optimisation

Page 1 of 2

Show 50 post(s) from this thread on one page

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - awk speed optimisation (https://www.linuxquestions.org/questions/programming-9/awk-speed-optimisation-769773/)

Rusty2727

11-17-2009 05:00 PM

awk speed optimisation

Hi,

I have a simple mawk script which needs speed optimisation.

Its input is a very large ascii file (10Gb or so), which is making things very slow at the moment.

The problem is as follows:

set min=a value
set max=a larger value

The guts of it:

mawk '{if (NR >= '"$min"' && NR <= '"$max"') {print $0}}' largeinput >! smalloutput

I am using mawk as it is supposed to be ultra quick - even quicker than C so people suggest. It has already cut the runtime down by about half compared with gawk.

The variables min and max are actually varying within a while loop, which I havent included. I have isolated the slow part of the routine and it is essentially this single operation.

I was wondering if anybody could think of how to optimise this mawk one liner for speed ???:)

ghostdog74

11-17-2009 05:28 PM

show your entire code. show your input data, describe your output.

cyent

11-17-2009 05:45 PM

Hmm.

How about change...
mawk '{if (NR >= '"$min"' && NR <= '"$max"') {print $0}}'

to something like..
mawk 'NR >= $min && NR <= $max'

The usual awk thing is "pattern action"

If pattern matches, do action.

If there is no pattern, awk does the action everytime.

If there is a pattern and no action, the default action is print the record.

You have no pattern, so you do the action every time.

The sad thing is your action is "if inrange print record"

So what I'm proposing is drop the action (the default is what you want) and make the expression your pattern.

My .signature really does apply to this case!

H_TeXMeX_H

11-18-2009 06:12 AM

If you want speed, why not write it in C ? It must be in awk ?

ghostdog74

11-18-2009 06:27 AM

Quote:

Originally Posted by H_TeXMeX_H (Post 3761179)

If you want speed, why not write it in C ? It must be in awk ?

mawk is already a "C" program. It is "reputed" to be faster than C, or so it seems. the problem should most probably be the other parts of his code (that while loop he mentioned?)

H_TeXMeX_H

11-18-2009 07:09 AM

Yeah, I guess we need to see the code.

Rusty2727

11-18-2009 07:38 AM

awk mawk C

Thanks for all your replies they are much appreciated.

The input file is just a two column ascii list e.g.

818 4.35787
819 4.46768
820 4.58789

total number of rows is 6348780. Total file size is 88544513 (OK I over exagerated file size initially it still slow though)

I tried cyent's suggestion which works fine but doesn't seem to provide a significant performance uplift. I kind of get what he is saying though.

The variables min and max are actually defined by looking in another list which tells me the specific row numbers required.

I am pretty sure that the performance is nothing to do with the while loop. This mawk routine is the slow down point as I have put echo statements around each procedure in the loop and you can clearly see the computer hanging while it executes the stated mawk.

Plus you can see the runtime is large on this mawk routine (Takes approx 3 seconds) if you just run this single procedure on the command line.

mawk 'NR >= '"$min"' && NR <= '"$max"'' largeinput >! smalloutput

I have even tried putting actual values in the mawk statement to avoid the internal variables i.e.

mawk 'NR >= 90 && NR <= 180 largeinput >! smalloutput

This does not help either.

An alternative is to sort the input list with the following:

sort <largeinput -n +0 +1 >! sortedlargeinput

Then each time I just need to take the first 90 rows so could use someting like this.

mawk 'NR <= 90' sortedlargeinput >! small output

But what gets me is this isn't any faster either despite the fact that mawk should just terminate the procedure once it has got the first 90 rows which cant take it that long.

I therefore presume that the slowdown is due to the fact that the input must be loaded into memory. Once this is done (which takes the majority of time) the actual details given in the mawk routine dont take very long ?????

So is mawk any faster than C ???? Any suggestions on how to remedy this would be much appreciated.

Thanks for your input so far.

cyent

11-18-2009 02:10 PM

These days the memory hierarchy dominates. Registers / L1 Cache / L2 Cache / Ram / Disk each level can be several orders of magnitude slower.

ie. Whatever you are doing, whatever your algorithm / language / .... the speed is dominated by things like whether the data is in a RAM buffer or on disk.

When benchmarking programs on linux systems I find cache effects
dominate.

ie. On current systems the difference in speed between RAM and disk is
so vast... that 2 orders of magnitude differences in algorithms can be
swallowed entirely by whether the data is in a ram disk buffer or on
disk.

So linux has a way of flushing clean caches....

echo 3 > /proc/sys/vm/drop_caches

For example... cat the cat program to /dev/null twice to make the
cache "hot", measure the time on a "hot" cache.

Then sync and drop the caches and do it on a cold cache.

cat /bin/cat > /dev/null;time cat /bin/cat > /dev/null;sync;sudo bash -c 'echo 3 > /proc/sys/vm/drop_caches';time cat /bin/cat > /dev/null

real 0m0.002s
user 0m0.000s
sys 0m0.004s

real 0m0.172s
user 0m0.000s
sys 0m0.008s

86x slower!

Oh yes, don't forget the other side... The job is not done until it's actually flushed to disk. So add a "sync" or two at the end of your commands when you are benchmarking.

So try that.

sync then drop the caches then just "cat" the entire file to another file and sync. Compare the "wall clock" time for that versus mawk. I suspect you will find the real bottleneck is just shuffling things off your drive.

The other thing you can do is when NR goes above your upper bound, exit!

catkin

11-19-2009 01:14 AM

:twocents: how about ... the awk family works on pattern match followed by action. The algorithm does not need pattern matching. So ... might it be quicker to do it all in the /BEGIN/ section, using a getline loop?

H_TeXMeX_H

11-19-2009 05:30 AM

NOTE: See below for the right program, this one calculates min and max from a list, not what you want it seems.

Well, if you can use C, I wrote this program a while ago to get minimum and maximum of a list (someone else here asked for it):

Code:

// calculates min and max for input file

// not much input checking is done, so use this program wisely, don't throw it crap or it will give you crap



#include <stdio.h>

#include <stdlib.h>



int main(int argc, char *argv[])

{

        FILE * file;

        int num, max, min;

        

        // make sure we have only one input parameter, this includes the name of the program run, thus 2

        if ( 2 != argc )

        {

                fputs ("ERROR: Need exactly 1 argument !\n",stderr);

                printf("Usage:\t minmax file\n");

                exit(1);

        }

        

        // open file for reading

        file = fopen ( argv[1] , "r" );

        if ( file == NULL ) { fputs ("ERROR: failed to open file !\n",stderr); exit (1); }

        

        // get num from file

        fscanf (file, "%d", &num);

        

        // set starting values

        min = max = num;

        

        // read numbers until EOF, setting min and max as we go

        while ( !feof(file) )

        {

                if ( num < min )

                {

                        min = num;

                }

                else if ( num > max )

                {

                        max = num;

                }

                fscanf (file, "%d", &num);

        }

        

        // print min and max

        printf("min=\t%d\n", min);

        printf("max=\t%d\n", max);

        

        // close file

        fclose (file);

        

        return 0;

}

This only works with one column, but can be modified to work with more, or you can use another utility to fix the file beforehand. I doubt even mawk could beat C, I say run a few benchmarks.

Rusty2727

11-19-2009 07:29 AM

Thanks for C solution TexMeX

Also thanks Cyent - although you lost me a bit here in the tech talk but I think I know what you are getting at.

cyent

11-19-2009 02:27 PM

Alas.. that C program doesn't do what the mawk does.

The C computes the min and max of a list of numbers.

The Mawk writes out all lines with line number greater and equal to min and less than and equal to max.

PTrenholme

11-19-2009 04:33 PM

If you're extracting rows from the same file for different row sets, I think you'd be much better off to load the data into a data base (e.g., sqlite), create an index on the row number, and select the rows you want.

cyent

11-19-2009 06:11 PM

Quote:

Originally Posted by PTrenholme (Post 3763193)

Hmm. for data mining gigabytes of data NOSQL is one of my favourites and fits neatly into what he is doing already.

H_TeXMeX_H

11-20-2009 04:20 AM

Quote:

Originally Posted by cyent (Post 3763065)

oops, sorry, I was tired, here's the correct one :)

PHP Code:




// prints out the lines between and inculding the bounds given

// maximum length of a line
#define STR_MAX 100

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char *argv[])
{
    FILE * file;
    char line[STR_MAX];
    unsigned int min, max;
    unsigned int i = 1;

    // check arguments
    if ( 4 != argc )
    {
        fputs ("ERROR: Need exactly 3 arguments !\n",stderr);
        printf("Usage:\t mm min max file\n");
        exit(1);
    }

    // open file for reading
    file = fopen ( argv[3] , "r" );
    if ( file == NULL ) { fputs ("ERROR: failed to open file !\n",stderr); exit (1); }

    // read min and max from arguments
    sscanf(argv[1], "%u", &min);
    sscanf(argv[2], "%u", &max);

    // print lines until EOF
    while ( fgets (line , STR_MAX , file) && i <= max )
    {
        if ( i >= min )
        {
            printf("%s",line);
        }
        i ++;
    }

    // close file
    fclose (file);

    return 0;
}

From my tests it's about 4x faster than awk, but I don't know about mawk. (make sure to optimize for your processor and run 'strip --strip-unneeded' on it)

Also, take note of the STR_MAX variable, it is set to the maximum length of a line, make sure you set it right.

All times are GMT -5. The time now is 10:57 AM.

Page 1 of 2

Show 50 post(s) from this thread on one page