LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 04-07-2009, 11:13 AM   #46
vache
LQ Newbie
 
Registered: Apr 2009
Posts: 5

Original Poster
Rep: Reputation: 0

Oh, a little more info about the logs:

The log files are cycled out daily. There are ~100 of them, and they can be as small as a few megs, to over 10GB in size, all raw text. They all have the same number of columns. The data I wanted out of each file were specific columns, with no bias to the actual content inside the respective column (hence the simple awk statement)

I've done benchmarks using perl, ruby, php, and python scripts (all were slower than awk). My C app is about 1 second faster than awk on the same hardware, which really isn't any difference at all.

I just wanted to know if there was an alternative to the well-known (f|safe)read() functions specifically for large files, like the kind I was dealing with, which there isn't.

Thanks for the interesting feedback.

Last edited by vache; 04-07-2009 at 11:16 AM.
 
Old 04-07-2009, 11:20 AM   #47
jglands
LQ Newbie
 
Registered: Apr 2009
Posts: 17

Rep: Reputation: 1
Quote:
Originally Posted by int0x80 View Post
You wish C# could be as good as Java. Good luck with your Xtra Proprietary OS.
Windows has Java as well. That's why people create in Java, because they know there is no money in Linux so they want it to work in a real operating system.
 
Old 04-07-2009, 11:22 AM   #48
int0x80
Member
 
Registered: Sep 2002
Location: Cincinnati
Distribution: Debian GNU/Linux
Posts: 310

Rep: Reputation: 31
Quote:
Originally Posted by jglands View Post
Windows has Java as well. That's why people create in Java, because they know there is no money in Linux so they want it to work in a real operating system.
So you just explained why C# is unnecessary and another way for MS to rip people off.

FAIL
 
Old 04-07-2009, 11:25 AM   #49
jglands
LQ Newbie
 
Registered: Apr 2009
Posts: 17

Rep: Reputation: 1
Quote:
Originally Posted by int0x80 View Post
So you just explained why C# is unnecessary and another way for MS to rip people off.

FAIL
You're just jealous because your coding in Linux isn't making you a dime.

WIN
 
Old 04-07-2009, 11:30 AM   #50
int0x80
Member
 
Registered: Sep 2002
Location: Cincinnati
Distribution: Debian GNU/Linux
Posts: 310

Rep: Reputation: 31
McDonalds took the money they were wasting on MS licenses and gave it to me to write code on Linux.

WIN
 
Old 04-07-2009, 11:31 AM   #51
tuxdev
Senior Member
 
Registered: Jul 2005
Distribution: Slackware
Posts: 2,014

Rep: Reputation: 115Reputation: 115
can you modify the app that generates the logs? If so, then the best option might not be to parse the huge logs after-the-fact, but have the app generate the simplified logs you want.

If you can't modify the app, you can possibly create a small daemon that watches the log file for changes and parses a line as soon as it is written. This approach doesn't actually save you any total processing time, but does spread it out so it doesn't bog the machine down.
 
Old 04-07-2009, 11:32 AM   #52
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 4,481

Rep: Reputation: 453Reputation: 453Reputation: 453Reputation: 453Reputation: 453
Quote:
Originally Posted by tuxdev View Post
can you modify the app that generates the logs? If so, then the best option might not be to parse the huge logs after-the-fact, but have the app generate the simplified logs you want.

If you can't modify the app, you can possibly create a small daemon that watches the log file for changes and parses a line as soon as it is written. This approach doesn't actually save you any total processing time, but does spread it out so it doesn't bog the machine down.
Exactly my point.
 
Old 04-07-2009, 11:35 AM   #53
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 4,481

Rep: Reputation: 453Reputation: 453Reputation: 453Reputation: 453Reputation: 453
By the way, if number of whitespaces between columns is constant, 'cut' (yes, 'cut', not 'cat') can be used - internally it should be simpler than 'awk', so maybe faster.

Anyway, that won't solve the IO speed problem.
 
Old 04-07-2009, 12:01 PM   #54
jglands
LQ Newbie
 
Registered: Apr 2009
Posts: 17

Rep: Reputation: 1
Why isn't you didn't write the script in Fortran?
 
Old 04-07-2009, 10:32 PM   #55
jglands
LQ Newbie
 
Registered: Apr 2009
Posts: 17

Rep: Reputation: 1
I knew you would surrender.

It was my intense debating skills. If I can make Microsoft look good I can make anyone look good. ;-)
 
Old 04-07-2009, 11:15 PM   #56
crabboy
Moderator
 
Registered: Feb 2001
Location: Atlanta, GA
Distribution: Slackware
Posts: 1,823

Rep: Reputation: 120Reputation: 120
Please try to stay on topic. This thread is completely polluted with BS.

On topic:

The inner loop of your program that is using strtok is very inefficient. It is really killing your programs performance. I wrote a short program using read to grab chunks of data and pointers to scan the buffer and it would still not top the performance I was getting with awk. I think that the data copy into the buffer was slowing it down, so I switched to using mmap. With mmap I was able to top awk's performance.

Code:
#include <stdio.h>
#include <ctype.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <unistd.h>

int readsize(long offset,  long filesize, long pagesize)
{
   int left = filesize - offset;
   if ( left > pagesize )
      return (pagesize);
   else
      return (left);
}

void process( char * pszFileName )
{
   int fd;
   void * pmap;
   char * pBuf;
   char * pEnd;
   int fieldcount = 0;
   char ch = (char) NULL;
   char lastChar = (char) NULL;
   int filesize = 0;
   int readlen = 0;
   int offset = 0;
   long pagesize = 0;

   pagesize = sysconf(_SC_PAGE_SIZE);

   fd = open( pszFileName, O_RDONLY );

   filesize = lseek( fd, 0, SEEK_END );
   readlen = readsize( 0, filesize, pagesize );

   while ( readlen > 0 )
   {
      pmap = mmap( 0, readlen, PROT_READ, MAP_SHARED, fd, offset );
      pBuf = (char *)pmap;
      pEnd = (char *)pmap + readlen;

      while ( pBuf < pEnd )
      {
         ch = *pBuf++;
         if ( ch == '\n' )
         {
            printf("\n");
            fieldcount = 0;
            lastChar = (char)NULL;
            continue;
         }
         else if ( (!isspace(ch)) && (isspace(lastChar)))
            fieldcount++;
         else if ( (isspace(ch)) && (isspace(lastChar)))
            continue;

         switch ( fieldcount + 1 )
         {
               case 1:
               case 2:
               case 4:
               case 6:
               case 11:
               case 12:
               case 13:
                  printf("%c", ch);
                  break;
         }
         lastChar = ch;
      }

      munmap( pmap, readlen );
      offset = offset + readlen;
      readlen = readsize( offset, filesize, pagesize );
   }
}

main(int argc, char *argv[])
{
   if ( argc == 2 )
   {
      process( argv[1] );
   }
}
 
Old 04-07-2009, 11:47 PM   #57
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 241Reputation: 241Reputation: 241
Quote:
Originally Posted by vache View Post
I've done benchmarks using perl, ruby, php, and python scripts (all were slower than awk).
you can try mmap. Perl mmap, Python mmap
 
Old 04-07-2009, 11:51 PM   #58
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 241Reputation: 241Reputation: 241
Quote:
Originally Posted by crabboy View Post
Please try to stay on topic. This thread is completely polluted with BS.
are you able to delete all those BS?
 
  


Reply

Tags
ascii, awk, fgets, fopen, parse


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
how can I differentiate two large text files using shell script? Files are like below surya_gadde Linux - Software 1 01-20-2009 03:52 AM
parse input text file and generate output TsanChung Programming 5 07-27-2008 11:23 PM
How to parse text file to a set text column width and output to new text file? jsstevenson Programming 12 04-23-2008 03:36 PM
sed script to read only columns 4 to 6 in output database cranium2004 Programming 10 02-28-2006 08:20 AM
How to parse log files into text view using GLADE shandy^^^ Programming 8 02-07-2006 09:13 PM


All times are GMT -5. The time now is 04:10 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration