LinuxQuestions.org
Support LQ: Use code LQCO20 and save 20% on CrossOver Office
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Tags used in this thread
Popular LQ Tags , , , , ,

Reply
 
Thread Tools
Old 04-07-2009, 11:13 AM   #46
vache
LQ Newbie
 
Registered: Apr 2009
Posts: 5
Thanked: 0

Original Poster

[Log in to get rid of this advertisement]
Oh, a little more info about the logs:

The log files are cycled out daily. There are ~100 of them, and they can be as small as a few megs, to over 10GB in size, all raw text. They all have the same number of columns. The data I wanted out of each file were specific columns, with no bias to the actual content inside the respective column (hence the simple awk statement)

I've done benchmarks using perl, ruby, php, and python scripts (all were slower than awk). My C app is about 1 second faster than awk on the same hardware, which really isn't any difference at all.

I just wanted to know if there was an alternative to the well-known (f|safe)read() functions specifically for large files, like the kind I was dealing with, which there isn't.

Thanks for the interesting feedback.

Last edited by vache; 04-07-2009 at 11:16 AM..
vache is offline     Reply With Quote
Old 04-07-2009, 11:20 AM   #47
jglands
LQ Newbie
 
Registered: Apr 2009
Posts: 17
Thanked: 1
Quote:
Originally Posted by int0x80 View Post
You wish C# could be as good as Java. Good luck with your Xtra Proprietary OS.
Windows has Java as well. That's why people create in Java, because they know there is no money in Linux so they want it to work in a real operating system.
jglands is offline     Reply With Quote
Old 04-07-2009, 11:22 AM   #48
int0x80
Member
 
Registered: Sep 2002
Location: Cincinnati
Distribution: Debian GNU/Linux
Posts: 300
Thanked: 1
Quote:
Originally Posted by jglands View Post
Windows has Java as well. That's why people create in Java, because they know there is no money in Linux so they want it to work in a real operating system.
So you just explained why C# is unnecessary and another way for MS to rip people off.

FAIL
int0x80 is offline     Reply With Quote
Old 04-07-2009, 11:25 AM   #49
jglands
LQ Newbie
 
Registered: Apr 2009
Posts: 17
Thanked: 1
Quote:
Originally Posted by int0x80 View Post
So you just explained why C# is unnecessary and another way for MS to rip people off.

FAIL
You're just jealous because your coding in Linux isn't making you a dime.

WIN
jglands is offline     Reply With Quote
Old 04-07-2009, 11:30 AM   #50
int0x80
Member
 
Registered: Sep 2002
Location: Cincinnati
Distribution: Debian GNU/Linux
Posts: 300
Thanked: 1
McDonalds took the money they were wasting on MS licenses and gave it to me to write code on Linux.

WIN
int0x80 is offline     Reply With Quote
Old 04-07-2009, 11:31 AM   #51
tuxdev
Senior Member
 
Registered: Jul 2005
Distribution: Slackware
Posts: 1,636
Thanked: 18
can you modify the app that generates the logs? If so, then the best option might not be to parse the huge logs after-the-fact, but have the app generate the simplified logs you want.

If you can't modify the app, you can possibly create a small daemon that watches the log file for changes and parses a line as soon as it is written. This approach doesn't actually save you any total processing time, but does spread it out so it doesn't bog the machine down.
tuxdev is offline     Reply With Quote
Old 04-07-2009, 11:32 AM   #52
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 1,036
Thanked: 52
Quote:
Originally Posted by tuxdev View Post
can you modify the app that generates the logs? If so, then the best option might not be to parse the huge logs after-the-fact, but have the app generate the simplified logs you want.

If you can't modify the app, you can possibly create a small daemon that watches the log file for changes and parses a line as soon as it is written. This approach doesn't actually save you any total processing time, but does spread it out so it doesn't bog the machine down.
Exactly my point.
Sergei Steshenko is offline     Reply With Quote
Old 04-07-2009, 11:35 AM   #53
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 1,036
Thanked: 52
By the way, if number of whitespaces between columns is constant, 'cut' (yes, 'cut', not 'cat') can be used - internally it should be simpler than 'awk', so maybe faster.

Anyway, that won't solve the IO speed problem.
Sergei Steshenko is offline     Reply With Quote
Old 04-07-2009, 12:01 PM   #54
jglands
LQ Newbie
 
Registered: Apr 2009
Posts: 17
Thanked: 1
Why isn't you didn't write the script in Fortran?
jglands is offline     Reply With Quote
Old 04-07-2009, 10:32 PM   #55
jglands
LQ Newbie
 
Registered: Apr 2009
Posts: 17
Thanked: 1
I knew you would surrender.

It was my intense debating skills. If I can make Microsoft look good I can make anyone look good. ;-)
jglands is offline     Reply With Quote
Old 04-07-2009, 11:15 PM   #56
crabboy
Moderator
 
Registered: Feb 2001
Location: Atlanta, GA
Distribution: Slackware
Posts: 1,596
Thanked: 12
Please try to stay on topic. This thread is completely polluted with BS.

On topic:

The inner loop of your program that is using strtok is very inefficient. It is really killing your programs performance. I wrote a short program using read to grab chunks of data and pointers to scan the buffer and it would still not top the performance I was getting with awk. I think that the data copy into the buffer was slowing it down, so I switched to using mmap. With mmap I was able to top awk's performance.

Code:
#include <stdio.h>
#include <ctype.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <unistd.h>

int readsize(long offset,  long filesize, long pagesize)
{
   int left = filesize - offset;
   if ( left > pagesize )
      return (pagesize);
   else
      return (left);
}

void process( char * pszFileName )
{
   int fd;
   void * pmap;
   char * pBuf;
   char * pEnd;
   int fieldcount = 0;
   char ch = (char) NULL;
   char lastChar = (char) NULL;
   int filesize = 0;
   int readlen = 0;
   int offset = 0;
   long pagesize = 0;

   pagesize = sysconf(_SC_PAGE_SIZE);

   fd = open( pszFileName, O_RDONLY );

   filesize = lseek( fd, 0, SEEK_END );
   readlen = readsize( 0, filesize, pagesize );

   while ( readlen > 0 )
   {
      pmap = mmap( 0, readlen, PROT_READ, MAP_SHARED, fd, offset );
      pBuf = (char *)pmap;
      pEnd = (char *)pmap + readlen;

      while ( pBuf < pEnd )
      {
         ch = *pBuf++;
         if ( ch == '\n' )
         {
            printf("\n");
            fieldcount = 0;
            lastChar = (char)NULL;
            continue;
         }
         else if ( (!isspace(ch)) && (isspace(lastChar)))
            fieldcount++;
         else if ( (isspace(ch)) && (isspace(lastChar)))
            continue;

         switch ( fieldcount + 1 )
         {
               case 1:
               case 2:
               case 4:
               case 6:
               case 11:
               case 12:
               case 13:
                  printf("%c", ch);
                  break;
         }
         lastChar = ch;
      }

      munmap( pmap, readlen );
      offset = offset + readlen;
      readlen = readsize( offset, filesize, pagesize );
   }
}

main(int argc, char *argv[])
{
   if ( argc == 2 )
   {
      process( argv[1] );
   }
}
crabboy is offline     Reply With Quote
Old 04-07-2009, 11:47 PM   #57
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 1,814
Blog Entries: 5
Thanked: 115
Quote:
Originally Posted by vache View Post
I've done benchmarks using perl, ruby, php, and python scripts (all were slower than awk).
you can try mmap. Perl mmap, Python mmap
ghostdog74 is offline     Reply With Quote
Old 04-07-2009, 11:51 PM   #58
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 1,814
Blog Entries: 5
Thanked: 115
Quote:
Originally Posted by crabboy View Post
Please try to stay on topic. This thread is completely polluted with BS.
are you able to delete all those BS?
ghostdog74 is offline     Reply With Quote

Reply

Bookmarks


Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
how can I differentiate two large text files using shell script? Files are like below surya_gadde Linux - Software 1 01-20-2009 03:52 AM
parse input text file and generate output TsanChung Programming 5 07-27-2008 11:23 PM
How to parse text file to a set text column width and output to new text file? jsstevenson Programming 12 04-23-2008 03:36 PM
sed script to read only columns 4 to 6 in output database cranium2004 Programming 10 02-28-2006 08:20 AM
How to parse log files into text view using GLADE shandy^^^ Programming 8 02-07-2006 09:13 PM


All times are GMT -5. The time now is 01:46 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
RSS2  LQ Podcast
RSS2  LQ Radio
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: @linuxquestions
Open Source Consulting | Domain Registration