LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Read large text files (~10GB), parse for columns, output (https://www.linuxquestions.org/questions/programming-9/read-large-text-files-%7E10gb-parse-for-columns-output-717217/)

tuxdev 04-07-2009 10:31 AM

can you modify the app that generates the logs? If so, then the best option might not be to parse the huge logs after-the-fact, but have the app generate the simplified logs you want.

If you can't modify the app, you can possibly create a small daemon that watches the log file for changes and parses a line as soon as it is written. This approach doesn't actually save you any total processing time, but does spread it out so it doesn't bog the machine down.

Sergei Steshenko 04-07-2009 10:32 AM

Quote:

Originally Posted by tuxdev (Post 3501307)
can you modify the app that generates the logs? If so, then the best option might not be to parse the huge logs after-the-fact, but have the app generate the simplified logs you want.

If you can't modify the app, you can possibly create a small daemon that watches the log file for changes and parses a line as soon as it is written. This approach doesn't actually save you any total processing time, but does spread it out so it doesn't bog the machine down.

Exactly my point.

Sergei Steshenko 04-07-2009 10:35 AM

By the way, if number of whitespaces between columns is constant, 'cut' (yes, 'cut', not 'cat') can be used - internally it should be simpler than 'awk', so maybe faster.

Anyway, that won't solve the IO speed problem.

jglands 04-07-2009 11:01 AM

Why isn't you didn't write the script in Fortran?

jglands 04-07-2009 09:32 PM

I knew you would surrender.

It was my intense debating skills. If I can make Microsoft look good I can make anyone look good. ;-)

crabboy 04-07-2009 10:15 PM

Please try to stay on topic. This thread is completely polluted with BS.

On topic:

The inner loop of your program that is using strtok is very inefficient. It is really killing your programs performance. I wrote a short program using read to grab chunks of data and pointers to scan the buffer and it would still not top the performance I was getting with awk. I think that the data copy into the buffer was slowing it down, so I switched to using mmap. With mmap I was able to top awk's performance.

Code:

#include <stdio.h>
#include <ctype.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <unistd.h>

int readsize(long offset,  long filesize, long pagesize)
{
  int left = filesize - offset;
  if ( left > pagesize )
      return (pagesize);
  else
      return (left);
}

void process( char * pszFileName )
{
  int fd;
  void * pmap;
  char * pBuf;
  char * pEnd;
  int fieldcount = 0;
  char ch = (char) NULL;
  char lastChar = (char) NULL;
  int filesize = 0;
  int readlen = 0;
  int offset = 0;
  long pagesize = 0;

  pagesize = sysconf(_SC_PAGE_SIZE);

  fd = open( pszFileName, O_RDONLY );

  filesize = lseek( fd, 0, SEEK_END );
  readlen = readsize( 0, filesize, pagesize );

  while ( readlen > 0 )
  {
      pmap = mmap( 0, readlen, PROT_READ, MAP_SHARED, fd, offset );
      pBuf = (char *)pmap;
      pEnd = (char *)pmap + readlen;

      while ( pBuf < pEnd )
      {
        ch = *pBuf++;
        if ( ch == '\n' )
        {
            printf("\n");
            fieldcount = 0;
            lastChar = (char)NULL;
            continue;
        }
        else if ( (!isspace(ch)) && (isspace(lastChar)))
            fieldcount++;
        else if ( (isspace(ch)) && (isspace(lastChar)))
            continue;

        switch ( fieldcount + 1 )
        {
              case 1:
              case 2:
              case 4:
              case 6:
              case 11:
              case 12:
              case 13:
                  printf("%c", ch);
                  break;
        }
        lastChar = ch;
      }

      munmap( pmap, readlen );
      offset = offset + readlen;
      readlen = readsize( offset, filesize, pagesize );
  }
}

main(int argc, char *argv[])
{
  if ( argc == 2 )
  {
      process( argv[1] );
  }
}


ghostdog74 04-07-2009 10:47 PM

Quote:

Originally Posted by vache (Post 3501285)
I've done benchmarks using perl, ruby, php, and python scripts (all were slower than awk).

you can try mmap. Perl mmap, Python mmap

ghostdog74 04-07-2009 10:51 PM

Quote:

Originally Posted by crabboy (Post 3501954)
Please try to stay on topic. This thread is completely polluted with BS.

are you able to delete all those BS?


All times are GMT -5. The time now is 12:12 PM.