LinuxQuestions.org
Visit the LQ Articles and Editorials section
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices



Reply
 
Search this Thread
Old 11-17-2009, 06:00 PM   #1
Rusty2727
LQ Newbie
 
Registered: Nov 2009
Posts: 4

Rep: Reputation: 0
Question awk speed optimisation


Hi,

I have a simple mawk script which needs speed optimisation.

Its input is a very large ascii file (10Gb or so), which is making things very slow at the moment.

The problem is as follows:

set min=a value
set max=a larger value

The guts of it:

mawk '{if (NR >= '"$min"' && NR <= '"$max"') {print $0}}' largeinput >! smalloutput

I am using mawk as it is supposed to be ultra quick - even quicker than C so people suggest. It has already cut the runtime down by about half compared with gawk.

The variables min and max are actually varying within a while loop, which I havent included. I have isolated the slow part of the routine and it is essentially this single operation.

I was wondering if anybody could think of how to optimise this mawk one liner for speed ???
 
Old 11-17-2009, 06:28 PM   #2
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 241Reputation: 241Reputation: 241
show your entire code. show your input data, describe your output.
 
Old 11-17-2009, 06:45 PM   #3
cyent
Member
 
Registered: Aug 2001
Location: ChristChurch New Zealand
Distribution: Ubuntu
Posts: 273

Rep: Reputation: 48
Hmm.

How about change...
mawk '{if (NR >= '"$min"' && NR <= '"$max"') {print $0}}'

to something like..
mawk 'NR >= $min && NR <= $max'

The usual awk thing is "pattern action"

If pattern matches, do action.

If there is no pattern, awk does the action everytime.

If there is a pattern and no action, the default action is print the record.

You have no pattern, so you do the action every time.

The sad thing is your action is "if inrange print record"

So what I'm proposing is drop the action (the default is what you want) and make the expression your pattern.

My .signature really does apply to this case!
 
Old 11-18-2009, 07:12 AM   #4
H_TeXMeX_H
Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269
If you want speed, why not write it in C ? It must be in awk ?
 
Old 11-18-2009, 07:27 AM   #5
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 241Reputation: 241Reputation: 241
Quote:
Originally Posted by H_TeXMeX_H View Post
If you want speed, why not write it in C ? It must be in awk ?
mawk is already a "C" program. It is "reputed" to be faster than C, or so it seems. the problem should most probably be the other parts of his code (that while loop he mentioned?)
 
Old 11-18-2009, 08:09 AM   #6
H_TeXMeX_H
Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269
Yeah, I guess we need to see the code.
 
Old 11-18-2009, 08:38 AM   #7
Rusty2727
LQ Newbie
 
Registered: Nov 2009
Posts: 4

Original Poster
Rep: Reputation: 0
awk mawk C

Thanks for all your replies they are much appreciated.

The input file is just a two column ascii list e.g.

818 4.35787
819 4.46768
820 4.58789

total number of rows is 6348780. Total file size is 88544513 (OK I over exagerated file size initially it still slow though)

I tried cyent's suggestion which works fine but doesn't seem to provide a significant performance uplift. I kind of get what he is saying though.

The variables min and max are actually defined by looking in another list which tells me the specific row numbers required.

I am pretty sure that the performance is nothing to do with the while loop. This mawk routine is the slow down point as I have put echo statements around each procedure in the loop and you can clearly see the computer hanging while it executes the stated mawk.

Plus you can see the runtime is large on this mawk routine (Takes approx 3 seconds) if you just run this single procedure on the command line.

mawk 'NR >= '"$min"' && NR <= '"$max"'' largeinput >! smalloutput

I have even tried putting actual values in the mawk statement to avoid the internal variables i.e.

mawk 'NR >= 90 && NR <= 180 largeinput >! smalloutput

This does not help either.

An alternative is to sort the input list with the following:

sort <largeinput -n +0 +1 >! sortedlargeinput

Then each time I just need to take the first 90 rows so could use someting like this.

mawk 'NR <= 90' sortedlargeinput >! small output

But what gets me is this isn't any faster either despite the fact that mawk should just terminate the procedure once it has got the first 90 rows which cant take it that long.

I therefore presume that the slowdown is due to the fact that the input must be loaded into memory. Once this is done (which takes the majority of time) the actual details given in the mawk routine dont take very long ?????

So is mawk any faster than C ???? Any suggestions on how to remedy this would be much appreciated.

Thanks for your input so far.
 
Old 11-18-2009, 03:10 PM   #8
cyent
Member
 
Registered: Aug 2001
Location: ChristChurch New Zealand
Distribution: Ubuntu
Posts: 273

Rep: Reputation: 48
These days the memory hierarchy dominates. Registers / L1 Cache / L2 Cache / Ram / Disk each level can be several orders of magnitude slower.

ie. Whatever you are doing, whatever your algorithm / language / .... the speed is dominated by things like whether the data is in a RAM buffer or on disk.

When benchmarking programs on linux systems I find cache effects
dominate.

ie. On current systems the difference in speed between RAM and disk is
so vast... that 2 orders of magnitude differences in algorithms can be
swallowed entirely by whether the data is in a ram disk buffer or on
disk.

So linux has a way of flushing clean caches....

echo 3 > /proc/sys/vm/drop_caches

For example... cat the cat program to /dev/null twice to make the
cache "hot", measure the time on a "hot" cache.

Then sync and drop the caches and do it on a cold cache.

cat /bin/cat > /dev/null;time cat /bin/cat > /dev/null;sync;sudo bash -c 'echo 3 > /proc/sys/vm/drop_caches';time cat /bin/cat > /dev/null

real 0m0.002s
user 0m0.000s
sys 0m0.004s

real 0m0.172s
user 0m0.000s
sys 0m0.008s

86x slower!

Oh yes, don't forget the other side... The job is not done until it's actually flushed to disk. So add a "sync" or two at the end of your commands when you are benchmarking.

So try that.

sync then drop the caches then just "cat" the entire file to another file and sync. Compare the "wall clock" time for that versus mawk. I suspect you will find the real bottleneck is just shuffling things off your drive.

The other thing you can do is when NR goes above your upper bound, exit!
 
Old 11-19-2009, 02:14 AM   #9
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Servers: Debian Squeeze and Wheezy. Desktop: Slackware64 14.0. Netbook: Slackware 13.37
Posts: 8,563
Blog Entries: 29

Rep: Reputation: 1179Reputation: 1179Reputation: 1179Reputation: 1179Reputation: 1179Reputation: 1179Reputation: 1179Reputation: 1179Reputation: 1179
how about ... the awk family works on pattern match followed by action. The algorithm does not need pattern matching. So ... might it be quicker to do it all in the /BEGIN/ section, using a getline loop?
 
Old 11-19-2009, 06:30 AM   #10
H_TeXMeX_H
Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269
NOTE: See below for the right program, this one calculates min and max from a list, not what you want it seems.

Well, if you can use C, I wrote this program a while ago to get minimum and maximum of a list (someone else here asked for it):

Code:
// calculates min and max for input file
// not much input checking is done, so use this program wisely, don't throw it crap or it will give you crap

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char *argv[])
{
	FILE * file;
	int num, max, min;
	
	// make sure we have only one input parameter, this includes the name of the program run, thus 2
	if ( 2 != argc )
	{
		fputs ("ERROR: Need exactly 1 argument !\n",stderr);
		printf("Usage:\t minmax file\n");
		exit(1);
	}
	
	// open file for reading
	file = fopen ( argv[1] , "r" );
	if ( file == NULL ) { fputs ("ERROR: failed to open file !\n",stderr); exit (1); }
	
	// get num from file
	fscanf (file, "%d", &num);
	
	// set starting values
	min = max = num;
	
	// read numbers until EOF, setting min and max as we go
	while ( !feof(file) )
	{
		if ( num < min )
		{
			min = num;
		}
		else if ( num > max )
		{
			max = num;
		}
		fscanf (file, "%d", &num);
	}
	
	// print min and max
	printf("min=\t%d\n", min);
	printf("max=\t%d\n", max);
	
	// close file
	fclose (file);
	
	return 0;
}
This only works with one column, but can be modified to work with more, or you can use another utility to fix the file beforehand. I doubt even mawk could beat C, I say run a few benchmarks.

Last edited by H_TeXMeX_H; 11-20-2009 at 05:24 AM.
 
Old 11-19-2009, 08:29 AM   #11
Rusty2727
LQ Newbie
 
Registered: Nov 2009
Posts: 4

Original Poster
Rep: Reputation: 0
Thanks for C solution TexMeX

Also thanks Cyent - although you lost me a bit here in the tech talk but I think I know what you are getting at.
 
Old 11-19-2009, 03:27 PM   #12
cyent
Member
 
Registered: Aug 2001
Location: ChristChurch New Zealand
Distribution: Ubuntu
Posts: 273

Rep: Reputation: 48
Alas.. that C program doesn't do what the mawk does.

The C computes the min and max of a list of numbers.

The Mawk writes out all lines with line number greater and equal to min and less than and equal to max.
 
Old 11-19-2009, 05:33 PM   #13
PTrenholme
Senior Member
 
Registered: Dec 2004
Location: Olympia, WA, USA
Distribution: Fedora, (K)Ubuntu
Posts: 4,154

Rep: Reputation: 333Reputation: 333Reputation: 333Reputation: 333
If you're extracting rows from the same file for different row sets, I think you'd be much better off to load the data into a data base (e.g., sqlite), create an index on the row number, and select the rows you want.
 
Old 11-19-2009, 07:11 PM   #14
cyent
Member
 
Registered: Aug 2001
Location: ChristChurch New Zealand
Distribution: Ubuntu
Posts: 273

Rep: Reputation: 48
Quote:
Originally Posted by PTrenholme View Post
If you're extracting rows from the same file for different row sets, I think you'd be much better off to load the data into a data base (e.g., sqlite), create an index on the row number, and select the rows you want.
Hmm. for data mining gigabytes of data NOSQL is one of my favourites and fits neatly into what he is doing already.
 
Old 11-20-2009, 05:20 AM   #15
H_TeXMeX_H
Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269
Quote:
Originally Posted by cyent View Post
Alas.. that C program doesn't do what the mawk does.

The C computes the min and max of a list of numbers.

The Mawk writes out all lines with line number greater and equal to min and less than and equal to max.
oops, sorry, I was tired, here's the correct one

PHP Code:
// prints out the lines between and inculding the bounds given

// maximum length of a line
#define STR_MAX 100

#include <stdio.h>
#include <stdlib.h>

int main(int argcchar *argv[])
{
    
FILE file;
    
char line[STR_MAX];
    
unsigned int minmax;
    
unsigned int i 1;

    
// check arguments
    
if ( != argc )
    {
        
fputs ("ERROR: Need exactly 3 arguments !\n",stderr);
        
printf("Usage:\t mm min max file\n");
        exit(
1);
    }

    
// open file for reading
    
file fopen argv[3] , "r" );
    if ( 
file == NULL ) { fputs ("ERROR: failed to open file !\n",stderr); exit (1); }

    
// read min and max from arguments
    
sscanf(argv[1], "%u", &min);
    
sscanf(argv[2], "%u", &max);

    
// print lines until EOF
    
while ( fgets (line STR_MAX file) && <= max )
    {
        if ( 
>= min )
        {
            
printf("%s",line);
        }
        
++;
    }

    
// close file
    
fclose (file);

    return 
0;

From my tests it's about 4x faster than awk, but I don't know about mawk. (make sure to optimize for your processor and run 'strip --strip-unneeded' on it)

Also, take note of the STR_MAX variable, it is set to the maximum length of a line, make sure you set it right.

Last edited by H_TeXMeX_H; 11-20-2009 at 05:34 AM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
ext3 optimisation STARHARVEST Linux - General 3 03-22-2008 11:24 PM
Performance optimisation on the notebooks SuSE_Lamer Debian 4 05-02-2006 01:57 PM
My own server - bandwidth use optimisation eomerek Linux - Networking 3 02-12-2006 05:20 PM
-march optimisation for kernel BenPope Linux - General 2 09-03-2003 12:36 PM
Compiling (GCC) - Optimisation anonE9 Programming 5 03-15-2003 09:20 PM


All times are GMT -5. The time now is 02:24 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration