awk speed optimisation
Hi,
I have a simple mawk script which needs speed optimisation. Its input is a very large ascii file (10Gb or so), which is making things very slow at the moment. The problem is as follows: set min=a value set max=a larger value The guts of it: mawk '{if (NR >= '"$min"' && NR <= '"$max"') {print $0}}' largeinput >! smalloutput I am using mawk as it is supposed to be ultra quick - even quicker than C so people suggest. It has already cut the runtime down by about half compared with gawk. The variables min and max are actually varying within a while loop, which I havent included. I have isolated the slow part of the routine and it is essentially this single operation. I was wondering if anybody could think of how to optimise this mawk one liner for speed ???:) |
show your entire code. show your input data, describe your output.
|
Hmm.
How about change... mawk '{if (NR >= '"$min"' && NR <= '"$max"') {print $0}}' to something like.. mawk 'NR >= $min && NR <= $max' The usual awk thing is "pattern action" If pattern matches, do action. If there is no pattern, awk does the action everytime. If there is a pattern and no action, the default action is print the record. You have no pattern, so you do the action every time. The sad thing is your action is "if inrange print record" So what I'm proposing is drop the action (the default is what you want) and make the expression your pattern. My .signature really does apply to this case! |
If you want speed, why not write it in C ? It must be in awk ?
|
Quote:
|
Yeah, I guess we need to see the code.
|
awk mawk C
Thanks for all your replies they are much appreciated.
The input file is just a two column ascii list e.g. 818 4.35787 819 4.46768 820 4.58789 total number of rows is 6348780. Total file size is 88544513 (OK I over exagerated file size initially it still slow though) I tried cyent's suggestion which works fine but doesn't seem to provide a significant performance uplift. I kind of get what he is saying though. The variables min and max are actually defined by looking in another list which tells me the specific row numbers required. I am pretty sure that the performance is nothing to do with the while loop. This mawk routine is the slow down point as I have put echo statements around each procedure in the loop and you can clearly see the computer hanging while it executes the stated mawk. Plus you can see the runtime is large on this mawk routine (Takes approx 3 seconds) if you just run this single procedure on the command line. mawk 'NR >= '"$min"' && NR <= '"$max"'' largeinput >! smalloutput I have even tried putting actual values in the mawk statement to avoid the internal variables i.e. mawk 'NR >= 90 && NR <= 180 largeinput >! smalloutput This does not help either. An alternative is to sort the input list with the following: sort <largeinput -n +0 +1 >! sortedlargeinput Then each time I just need to take the first 90 rows so could use someting like this. mawk 'NR <= 90' sortedlargeinput >! small output But what gets me is this isn't any faster either despite the fact that mawk should just terminate the procedure once it has got the first 90 rows which cant take it that long. I therefore presume that the slowdown is due to the fact that the input must be loaded into memory. Once this is done (which takes the majority of time) the actual details given in the mawk routine dont take very long ????? So is mawk any faster than C ???? Any suggestions on how to remedy this would be much appreciated. Thanks for your input so far. |
These days the memory hierarchy dominates. Registers / L1 Cache / L2 Cache / Ram / Disk each level can be several orders of magnitude slower.
ie. Whatever you are doing, whatever your algorithm / language / .... the speed is dominated by things like whether the data is in a RAM buffer or on disk. When benchmarking programs on linux systems I find cache effects dominate. ie. On current systems the difference in speed between RAM and disk is so vast... that 2 orders of magnitude differences in algorithms can be swallowed entirely by whether the data is in a ram disk buffer or on disk. So linux has a way of flushing clean caches.... echo 3 > /proc/sys/vm/drop_caches For example... cat the cat program to /dev/null twice to make the cache "hot", measure the time on a "hot" cache. Then sync and drop the caches and do it on a cold cache. cat /bin/cat > /dev/null;time cat /bin/cat > /dev/null;sync;sudo bash -c 'echo 3 > /proc/sys/vm/drop_caches';time cat /bin/cat > /dev/null real 0m0.002s user 0m0.000s sys 0m0.004s real 0m0.172s user 0m0.000s sys 0m0.008s 86x slower! Oh yes, don't forget the other side... The job is not done until it's actually flushed to disk. So add a "sync" or two at the end of your commands when you are benchmarking. So try that. sync then drop the caches then just "cat" the entire file to another file and sync. Compare the "wall clock" time for that versus mawk. I suspect you will find the real bottleneck is just shuffling things off your drive. The other thing you can do is when NR goes above your upper bound, exit! |
:twocents: how about ... the awk family works on pattern match followed by action. The algorithm does not need pattern matching. So ... might it be quicker to do it all in the /BEGIN/ section, using a getline loop?
|
NOTE: See below for the right program, this one calculates min and max from a list, not what you want it seems.
Well, if you can use C, I wrote this program a while ago to get minimum and maximum of a list (someone else here asked for it): Code:
// calculates min and max for input file |
Thanks for C solution TexMeX
Also thanks Cyent - although you lost me a bit here in the tech talk but I think I know what you are getting at. |
Alas.. that C program doesn't do what the mawk does.
The C computes the min and max of a list of numbers. The Mawk writes out all lines with line number greater and equal to min and less than and equal to max. |
If you're extracting rows from the same file for different row sets, I think you'd be much better off to load the data into a data base (e.g., sqlite), create an index on the row number, and select the rows you want.
|
Quote:
|
Quote:
PHP Code:
Also, take note of the STR_MAX variable, it is set to the maximum length of a line, make sure you set it right. |
All times are GMT -5. The time now is 10:57 AM. |