Oh, a little more info about the logs:
The log files are cycled out daily. There are ~100 of them, and they can be as small as a few megs, to over 10GB in size, all raw text. They all have the same number of columns. The data I wanted out of each file were specific columns, with no bias to the actual content inside the respective column (hence the simple awk statement) I've done benchmarks using perl, ruby, php, and python scripts (all were slower than awk). My C app is about 1 second faster than awk on the same hardware, which really isn't any difference at all. I just wanted to know if there was an alternative to the well-known (f|safe)read() functions specifically for large files, like the kind I was dealing with, which there isn't. Thanks for the interesting feedback. |
Quote:
|
Quote:
FAIL |
Quote:
WIN |
McDonalds took the money they were wasting on MS licenses and gave it to me to write code on Linux.
WIN |
can you modify the app that generates the logs? If so, then the best option might not be to parse the huge logs after-the-fact, but have the app generate the simplified logs you want.
If you can't modify the app, you can possibly create a small daemon that watches the log file for changes and parses a line as soon as it is written. This approach doesn't actually save you any total processing time, but does spread it out so it doesn't bog the machine down. |
Quote:
|
By the way, if number of whitespaces between columns is constant, 'cut' (yes, 'cut', not 'cat') can be used - internally it should be simpler than 'awk', so maybe faster.
Anyway, that won't solve the IO speed problem. |
Why isn't you didn't write the script in Fortran?
|
I knew you would surrender.
It was my intense debating skills. If I can make Microsoft look good I can make anyone look good. ;-) |
Please try to stay on topic. This thread is completely polluted with BS.
On topic: The inner loop of your program that is using strtok is very inefficient. It is really killing your programs performance. I wrote a short program using read to grab chunks of data and pointers to scan the buffer and it would still not top the performance I was getting with awk. I think that the data copy into the buffer was slowing it down, so I switched to using mmap. With mmap I was able to top awk's performance. Code:
#include <stdio.h> |
Quote:
|
Quote:
|
All times are GMT -5. The time now is 02:40 AM. |