LinuxQuestions.org - [SOLVED] Read large text files (~10GB), parse for columns, output

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - Read large text files (~10GB), parse for columns, output (https://www.linuxquestions.org/questions/programming-9/read-large-text-files-%7E10gb-parse-for-columns-output-717217/)

Oh, a little more info about the logs:

The log files are cycled out daily. There are ~100 of them, and they can be as small as a few megs, to over 10GB in size, all raw text. They all have the same number of columns. The data I wanted out of each file were specific columns, with no bias to the actual content inside the respective column (hence the simple awk statement)

I've done benchmarks using perl, ruby, php, and python scripts (all were slower than awk). My C app is about 1 second faster than awk on the same hardware, which really isn't any difference at all.

I just wanted to know if there was an alternative to the well-known (f|safe)read() functions specifically for large files, like the kind I was dealing with, which there isn't.

Thanks for the interesting feedback.

Quote:

Originally Posted by int0x80 (Post 3501281)

You wish C# could be as good as Java. Good luck with your Xtra Proprietary OS.

Windows has Java as well. That's why people create in Java, because they know there is no money in Linux so they want it to work in a real operating system.

Quote:

Originally Posted by jglands (Post 3501293)

Windows has Java as well. That's why people create in Java, because they know there is no money in Linux so they want it to work in a real operating system.

So you just explained why C# is unnecessary and another way for MS to rip people off.

FAIL

Quote:

Originally Posted by int0x80 (Post 3501294)

So you just explained why C# is unnecessary and another way for MS to rip people off.

FAIL

You're just jealous because your coding in Linux isn't making you a dime.

WIN

McDonalds took the money they were wasting on MS licenses and gave it to me to write code on Linux.

WIN

can you modify the app that generates the logs? If so, then the best option might not be to parse the huge logs after-the-fact, but have the app generate the simplified logs you want.

If you can't modify the app, you can possibly create a small daemon that watches the log file for changes and parses a line as soon as it is written. This approach doesn't actually save you any total processing time, but does spread it out so it doesn't bog the machine down.

Quote:

Originally Posted by tuxdev (Post 3501307)

can you modify the app that generates the logs? If so, then the best option might not be to parse the huge logs after-the-fact, but have the app generate the simplified logs you want.

If you can't modify the app, you can possibly create a small daemon that watches the log file for changes and parses a line as soon as it is written. This approach doesn't actually save you any total processing time, but does spread it out so it doesn't bog the machine down.

Exactly my point.

By the way, if number of whitespaces between columns is constant, 'cut' (yes, 'cut', not 'cat') can be used - internally it should be simpler than 'awk', so maybe faster.

Anyway, that won't solve the IO speed problem.

Why isn't you didn't write the script in Fortran?

I knew you would surrender.

It was my intense debating skills. If I can make Microsoft look good I can make anyone look good. ;-)

Please try to stay on topic. This thread is completely polluted with BS.

On topic:

The inner loop of your program that is using strtok is very inefficient. It is really killing your programs performance. I wrote a short program using read to grab chunks of data and pointers to scan the buffer and it would still not top the performance I was getting with awk. I think that the data copy into the buffer was slowing it down, so I switched to using mmap. With mmap I was able to top awk's performance.

Code:

#include <stdio.h>

#include <ctype.h>

#include <fcntl.h>

#include <sys/mman.h>

#include <unistd.h>



int readsize(long offset,  long filesize, long pagesize)

{

  int left = filesize - offset;

  if ( left > pagesize )

      return (pagesize);

  else

      return (left);

}



void process( char * pszFileName )

{

  int fd;

  void * pmap;

  char * pBuf;

  char * pEnd;

  int fieldcount = 0;

  char ch = (char) NULL;

  char lastChar = (char) NULL;

  int filesize = 0;

  int readlen = 0;

  int offset = 0;

  long pagesize = 0;



  pagesize = sysconf(_SC_PAGE_SIZE);



  fd = open( pszFileName, O_RDONLY );



  filesize = lseek( fd, 0, SEEK_END );

  readlen = readsize( 0, filesize, pagesize );



  while ( readlen > 0 )

  {

      pmap = mmap( 0, readlen, PROT_READ, MAP_SHARED, fd, offset );

      pBuf = (char *)pmap;

      pEnd = (char *)pmap + readlen;



      while ( pBuf < pEnd )

      {

        ch = *pBuf++;

        if ( ch == '\n' )

        {

            printf("\n");

            fieldcount = 0;

            lastChar = (char)NULL;

            continue;

        }

        else if ( (!isspace(ch)) && (isspace(lastChar)))

            fieldcount++;

        else if ( (isspace(ch)) && (isspace(lastChar)))

            continue;



        switch ( fieldcount + 1 )

        {

              case 1:

              case 2:

              case 4:

              case 6:

              case 11:

              case 12:

              case 13:

                  printf("%c", ch);

                  break;

        }

        lastChar = ch;

      }



      munmap( pmap, readlen );

      offset = offset + readlen;

      readlen = readsize( offset, filesize, pagesize );

  }

}



main(int argc, char *argv[])

{

  if ( argc == 2 )

  {

      process( argv[1] );

  }

}

Quote:

Originally Posted by vache (Post 3501285)

I've done benchmarks using perl, ruby, php, and python scripts (all were slower than awk).

you can try mmap. Perl mmap, Python mmap

Quote:

Originally Posted by crabboy (Post 3501954)

Please try to stay on topic. This thread is completely polluted with BS.

are you able to delete all those BS?