[SOLVED] Read large text files (~10GB), parse for columns, output
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
The log files are cycled out daily. There are ~100 of them, and they can be as small as a few megs, to over 10GB in size, all raw text. They all have the same number of columns. The data I wanted out of each file were specific columns, with no bias to the actual content inside the respective column (hence the simple awk statement)
I've done benchmarks using perl, ruby, php, and python scripts (all were slower than awk). My C app is about 1 second faster than awk on the same hardware, which really isn't any difference at all.
I just wanted to know if there was an alternative to the well-known (f|safe)read() functions specifically for large files, like the kind I was dealing with, which there isn't.
You wish C# could be as good as Java. Good luck with your Xtra Proprietary OS.
Windows has Java as well. That's why people create in Java, because they know there is no money in Linux so they want it to work in a real operating system.
Windows has Java as well. That's why people create in Java, because they know there is no money in Linux so they want it to work in a real operating system.
So you just explained why C# is unnecessary and another way for MS to rip people off.
can you modify the app that generates the logs? If so, then the best option might not be to parse the huge logs after-the-fact, but have the app generate the simplified logs you want.
If you can't modify the app, you can possibly create a small daemon that watches the log file for changes and parses a line as soon as it is written. This approach doesn't actually save you any total processing time, but does spread it out so it doesn't bog the machine down.
can you modify the app that generates the logs? If so, then the best option might not be to parse the huge logs after-the-fact, but have the app generate the simplified logs you want.
If you can't modify the app, you can possibly create a small daemon that watches the log file for changes and parses a line as soon as it is written. This approach doesn't actually save you any total processing time, but does spread it out so it doesn't bog the machine down.
By the way, if number of whitespaces between columns is constant, 'cut' (yes, 'cut', not 'cat') can be used - internally it should be simpler than 'awk', so maybe faster.
Please try to stay on topic. This thread is completely polluted with BS.
On topic:
The inner loop of your program that is using strtok is very inefficient. It is really killing your programs performance. I wrote a short program using read to grab chunks of data and pointers to scan the buffer and it would still not top the performance I was getting with awk. I think that the data copy into the buffer was slowing it down, so I switched to using mmap. With mmap I was able to top awk's performance.
Code:
#include <stdio.h>
#include <ctype.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <unistd.h>
int readsize(long offset, long filesize, long pagesize)
{
int left = filesize - offset;
if ( left > pagesize )
return (pagesize);
else
return (left);
}
void process( char * pszFileName )
{
int fd;
void * pmap;
char * pBuf;
char * pEnd;
int fieldcount = 0;
char ch = (char) NULL;
char lastChar = (char) NULL;
int filesize = 0;
int readlen = 0;
int offset = 0;
long pagesize = 0;
pagesize = sysconf(_SC_PAGE_SIZE);
fd = open( pszFileName, O_RDONLY );
filesize = lseek( fd, 0, SEEK_END );
readlen = readsize( 0, filesize, pagesize );
while ( readlen > 0 )
{
pmap = mmap( 0, readlen, PROT_READ, MAP_SHARED, fd, offset );
pBuf = (char *)pmap;
pEnd = (char *)pmap + readlen;
while ( pBuf < pEnd )
{
ch = *pBuf++;
if ( ch == '\n' )
{
printf("\n");
fieldcount = 0;
lastChar = (char)NULL;
continue;
}
else if ( (!isspace(ch)) && (isspace(lastChar)))
fieldcount++;
else if ( (isspace(ch)) && (isspace(lastChar)))
continue;
switch ( fieldcount + 1 )
{
case 1:
case 2:
case 4:
case 6:
case 11:
case 12:
case 13:
printf("%c", ch);
break;
}
lastChar = ch;
}
munmap( pmap, readlen );
offset = offset + readlen;
readlen = readsize( offset, filesize, pagesize );
}
}
main(int argc, char *argv[])
{
if ( argc == 2 )
{
process( argv[1] );
}
}
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.