| vache |
04-06-2009 11:17 AM |
Read large text files (~10GB), parse for columns, output
Hello, world.
The Goal: Read in ASCII text files, parse out specific columns, send to standard out.
I'm currently using a simple awk {print $1, ...} script to accomplish this. It's all fine and good, but the files I'm reading in are massive (10GB is not uncommon in our environment) and I speculate (hope) that a C or C++ application can parse these files faster than awk.
My C-fu is weak at best (featured below is my "101" level C code mushed together after lots of Google searches - ha) and it's actually slower than awk. If it matters, I have access to some very powerful hardware (64 bit quad core Xeon, 6GB of RAM).
What alternatives are there to fopen/etc. for reading in large files and parsing? Thanks in advance.
Code:
#include <stdio.h>
#include <string.h>
int main( int argc, char *argv[] )
{
/* No file supplied? */
if ( argc == 1 )
{
puts( "\nYou must supply a file to parse\n" );
return 1;
}
/* Open the file, read-only */
FILE *file = fopen( argv[1], "r" );
/* If the file exists... */
if ( file != NULL )
{
char line[256];
char del[] = " ";
/* While we can read a line from the file... */
while ( fgets( line, sizeof line, file ) != NULL )
{
/* Convert each line in to tokens */
char *result = NULL;
result = strtok( line, del );
int tkn = 1;
/* "Foreach" token... */
while( result != NULL )
{
/* If tkn matches our list, then print */
/* $1, $2, $4, $6, $11, $12, $13 */
if (
tkn == 1 || tkn == 2 || tkn == 3 ||
tkn == 4 || tkn == 6 || tkn == 11 ||
tkn == 12 || tkn == 13
)
{
printf( "%s ", result );
}
tkn++;
result = strtok( NULL, del );
}
}
fclose( file );
} else {
printf( "%s", argv[1] );
}
return 0;
}
|