ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
I am using mawk as it is supposed to be ultra quick - even quicker than C so people suggest. It has already cut the runtime down by about half compared with gawk.
The variables min and max are actually varying within a while loop, which I havent included. I have isolated the slow part of the routine and it is essentially this single operation.
I was wondering if anybody could think of how to optimise this mawk one liner for speed ???
If you want speed, why not write it in C ? It must be in awk ?
mawk is already a "C" program. It is "reputed" to be faster than C, or so it seems. the problem should most probably be the other parts of his code (that while loop he mentioned?)
Thanks for all your replies they are much appreciated.
The input file is just a two column ascii list e.g.
818 4.35787
819 4.46768
820 4.58789
total number of rows is 6348780. Total file size is 88544513 (OK I over exagerated file size initially it still slow though)
I tried cyent's suggestion which works fine but doesn't seem to provide a significant performance uplift. I kind of get what he is saying though.
The variables min and max are actually defined by looking in another list which tells me the specific row numbers required.
I am pretty sure that the performance is nothing to do with the while loop. This mawk routine is the slow down point as I have put echo statements around each procedure in the loop and you can clearly see the computer hanging while it executes the stated mawk.
Plus you can see the runtime is large on this mawk routine (Takes approx 3 seconds) if you just run this single procedure on the command line.
An alternative is to sort the input list with the following:
sort <largeinput -n +0 +1 >! sortedlargeinput
Then each time I just need to take the first 90 rows so could use someting like this.
mawk 'NR <= 90' sortedlargeinput >! small output
But what gets me is this isn't any faster either despite the fact that mawk should just terminate the procedure once it has got the first 90 rows which cant take it that long.
I therefore presume that the slowdown is due to the fact that the input must be loaded into memory. Once this is done (which takes the majority of time) the actual details given in the mawk routine dont take very long ?????
So is mawk any faster than C ???? Any suggestions on how to remedy this would be much appreciated.
These days the memory hierarchy dominates. Registers / L1 Cache / L2 Cache / Ram / Disk each level can be several orders of magnitude slower.
ie. Whatever you are doing, whatever your algorithm / language / .... the speed is dominated by things like whether the data is in a RAM buffer or on disk.
When benchmarking programs on linux systems I find cache effects
dominate.
ie. On current systems the difference in speed between RAM and disk is
so vast... that 2 orders of magnitude differences in algorithms can be
swallowed entirely by whether the data is in a ram disk buffer or on
disk.
So linux has a way of flushing clean caches....
echo 3 > /proc/sys/vm/drop_caches
For example... cat the cat program to /dev/null twice to make the
cache "hot", measure the time on a "hot" cache.
Then sync and drop the caches and do it on a cold cache.
Oh yes, don't forget the other side... The job is not done until it's actually flushed to disk. So add a "sync" or two at the end of your commands when you are benchmarking.
So try that.
sync then drop the caches then just "cat" the entire file to another file and sync. Compare the "wall clock" time for that versus mawk. I suspect you will find the real bottleneck is just shuffling things off your drive.
The other thing you can do is when NR goes above your upper bound, exit!
how about ... the awk family works on pattern match followed by action. The algorithm does not need pattern matching. So ... might it be quicker to do it all in the /BEGIN/ section, using a getline loop?
NOTE: See below for the right program, this one calculates min and max from a list, not what you want it seems.
Well, if you can use C, I wrote this program a while ago to get minimum and maximum of a list (someone else here asked for it):
Code:
// calculates min and max for input file
// not much input checking is done, so use this program wisely, don't throw it crap or it will give you crap
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[])
{
FILE * file;
int num, max, min;
// make sure we have only one input parameter, this includes the name of the program run, thus 2
if ( 2 != argc )
{
fputs ("ERROR: Need exactly 1 argument !\n",stderr);
printf("Usage:\t minmax file\n");
exit(1);
}
// open file for reading
file = fopen ( argv[1] , "r" );
if ( file == NULL ) { fputs ("ERROR: failed to open file !\n",stderr); exit (1); }
// get num from file
fscanf (file, "%d", &num);
// set starting values
min = max = num;
// read numbers until EOF, setting min and max as we go
while ( !feof(file) )
{
if ( num < min )
{
min = num;
}
else if ( num > max )
{
max = num;
}
fscanf (file, "%d", &num);
}
// print min and max
printf("min=\t%d\n", min);
printf("max=\t%d\n", max);
// close file
fclose (file);
return 0;
}
This only works with one column, but can be modified to work with more, or you can use another utility to fix the file beforehand. I doubt even mawk could beat C, I say run a few benchmarks.
Last edited by H_TeXMeX_H; 11-20-2009 at 04:24 AM.
If you're extracting rows from the same file for different row sets, I think you'd be much better off to load the data into a data base (e.g., sqlite), create an index on the row number, and select the rows you want.
If you're extracting rows from the same file for different row sets, I think you'd be much better off to load the data into a data base (e.g., sqlite), create an index on the row number, and select the rows you want.
Hmm. for data mining gigabytes of data NOSQL is one of my favourites and fits neatly into what he is doing already.
Alas.. that C program doesn't do what the mawk does.
The C computes the min and max of a list of numbers.
The Mawk writes out all lines with line number greater and equal to min and less than and equal to max.
oops, sorry, I was tired, here's the correct one
PHP Code:
// prints out the lines between and inculding the bounds given
// maximum length of a line #define STR_MAX 100
#include <stdio.h> #include <stdlib.h>
int main(int argc, char *argv[]) { FILE * file; char line[STR_MAX]; unsigned int min, max; unsigned int i = 1;
// check arguments if ( 4 != argc ) { fputs ("ERROR: Need exactly 3 arguments !\n",stderr); printf("Usage:\t mm min max file\n"); exit(1); }
// open file for reading file = fopen ( argv[3] , "r" ); if ( file == NULL ) { fputs ("ERROR: failed to open file !\n",stderr); exit (1); }
// read min and max from arguments sscanf(argv[1], "%u", &min); sscanf(argv[2], "%u", &max);
// print lines until EOF while ( fgets (line , STR_MAX , file) && i <= max ) { if ( i >= min ) { printf("%s",line); } i ++; }
// close file fclose (file);
return 0; }
From my tests it's about 4x faster than awk, but I don't know about mawk. (make sure to optimize for your processor and run 'strip --strip-unneeded' on it)
Also, take note of the STR_MAX variable, it is set to the maximum length of a line, make sure you set it right.
Last edited by H_TeXMeX_H; 11-20-2009 at 04:34 AM.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.