LinuxQuestions.org - [SOLVED] Read large text files (~10GB), parse for columns, output

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - Read large text files (~10GB), parse for columns, output (https://www.linuxquestions.org/questions/programming-9/read-large-text-files-%7E10gb-parse-for-columns-output-717217/)

can you modify the app that generates the logs? If so, then the best option might not be to parse the huge logs after-the-fact, but have the app generate the simplified logs you want.

If you can't modify the app, you can possibly create a small daemon that watches the log file for changes and parses a line as soon as it is written. This approach doesn't actually save you any total processing time, but does spread it out so it doesn't bog the machine down.

Quote:

Originally Posted by tuxdev (Post 3501307)

can you modify the app that generates the logs? If so, then the best option might not be to parse the huge logs after-the-fact, but have the app generate the simplified logs you want.

If you can't modify the app, you can possibly create a small daemon that watches the log file for changes and parses a line as soon as it is written. This approach doesn't actually save you any total processing time, but does spread it out so it doesn't bog the machine down.

Exactly my point.

By the way, if number of whitespaces between columns is constant, 'cut' (yes, 'cut', not 'cat') can be used - internally it should be simpler than 'awk', so maybe faster.

Anyway, that won't solve the IO speed problem.

Why isn't you didn't write the script in Fortran?

I knew you would surrender.

It was my intense debating skills. If I can make Microsoft look good I can make anyone look good. ;-)

Please try to stay on topic. This thread is completely polluted with BS.

On topic:

The inner loop of your program that is using strtok is very inefficient. It is really killing your programs performance. I wrote a short program using read to grab chunks of data and pointers to scan the buffer and it would still not top the performance I was getting with awk. I think that the data copy into the buffer was slowing it down, so I switched to using mmap. With mmap I was able to top awk's performance.

Code:

#include <stdio.h>

#include <ctype.h>

#include <fcntl.h>

#include <sys/mman.h>

#include <unistd.h>



int readsize(long offset,  long filesize, long pagesize)

{

  int left = filesize - offset;

  if ( left > pagesize )

      return (pagesize);

  else

      return (left);

}



void process( char * pszFileName )

{

  int fd;

  void * pmap;

  char * pBuf;

  char * pEnd;

  int fieldcount = 0;

  char ch = (char) NULL;

  char lastChar = (char) NULL;

  int filesize = 0;

  int readlen = 0;

  int offset = 0;

  long pagesize = 0;



  pagesize = sysconf(_SC_PAGE_SIZE);



  fd = open( pszFileName, O_RDONLY );



  filesize = lseek( fd, 0, SEEK_END );

  readlen = readsize( 0, filesize, pagesize );



  while ( readlen > 0 )

  {

      pmap = mmap( 0, readlen, PROT_READ, MAP_SHARED, fd, offset );

      pBuf = (char *)pmap;

      pEnd = (char *)pmap + readlen;



      while ( pBuf < pEnd )

      {

        ch = *pBuf++;

        if ( ch == '\n' )

        {

            printf("\n");

            fieldcount = 0;

            lastChar = (char)NULL;

            continue;

        }

        else if ( (!isspace(ch)) && (isspace(lastChar)))

            fieldcount++;

        else if ( (isspace(ch)) && (isspace(lastChar)))

            continue;



        switch ( fieldcount + 1 )

        {

              case 1:

              case 2:

              case 4:

              case 6:

              case 11:

              case 12:

              case 13:

                  printf("%c", ch);

                  break;

        }

        lastChar = ch;

      }



      munmap( pmap, readlen );

      offset = offset + readlen;

      readlen = readsize( offset, filesize, pagesize );

  }

}



main(int argc, char *argv[])

{

  if ( argc == 2 )

  {

      process( argv[1] );

  }

}

Quote:

Originally Posted by vache (Post 3501285)

I've done benchmarks using perl, ruby, php, and python scripts (all were slower than awk).

you can try mmap. Perl mmap, Python mmap

Quote:

Originally Posted by crabboy (Post 3501954)

Please try to stay on topic. This thread is completely polluted with BS.

are you able to delete all those BS?