LinuxQuestions.org
Help answer threads with 0 replies.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 04-07-2009, 10:13 AM   #46
vache
LQ Newbie
 
Registered: Apr 2009
Posts: 5

Original Poster
Rep: Reputation: 0

Oh, a little more info about the logs:

The log files are cycled out daily. There are ~100 of them, and they can be as small as a few megs, to over 10GB in size, all raw text. They all have the same number of columns. The data I wanted out of each file were specific columns, with no bias to the actual content inside the respective column (hence the simple awk statement)

I've done benchmarks using perl, ruby, php, and python scripts (all were slower than awk). My C app is about 1 second faster than awk on the same hardware, which really isn't any difference at all.

I just wanted to know if there was an alternative to the well-known (f|safe)read() functions specifically for large files, like the kind I was dealing with, which there isn't.

Thanks for the interesting feedback.

Last edited by vache; 04-07-2009 at 10:16 AM.
 
Old 04-07-2009, 10:20 AM   #47
jglands
LQ Newbie
 
Registered: Apr 2009
Posts: 17

Rep: Reputation: 1
Quote:
Originally Posted by int0x80 View Post
You wish C# could be as good as Java. Good luck with your Xtra Proprietary OS.
Windows has Java as well. That's why people create in Java, because they know there is no money in Linux so they want it to work in a real operating system.
 
Old 04-07-2009, 10:22 AM   #48
int0x80
Member
 
Registered: Sep 2002
Posts: 310

Rep: Reputation: Disabled
Quote:
Originally Posted by jglands View Post
Windows has Java as well. That's why people create in Java, because they know there is no money in Linux so they want it to work in a real operating system.
So you just explained why C# is unnecessary and another way for MS to rip people off.

FAIL
 
Old 04-07-2009, 10:25 AM   #49
jglands
LQ Newbie
 
Registered: Apr 2009
Posts: 17

Rep: Reputation: 1
Quote:
Originally Posted by int0x80 View Post
So you just explained why C# is unnecessary and another way for MS to rip people off.

FAIL
You're just jealous because your coding in Linux isn't making you a dime.

WIN
 
Old 04-07-2009, 10:30 AM   #50
int0x80
Member
 
Registered: Sep 2002
Posts: 310

Rep: Reputation: Disabled
McDonalds took the money they were wasting on MS licenses and gave it to me to write code on Linux.

WIN
 
Old 04-07-2009, 10:31 AM   #51
tuxdev
Senior Member
 
Registered: Jul 2005
Distribution: Slackware
Posts: 2,012

Rep: Reputation: 115Reputation: 115
can you modify the app that generates the logs? If so, then the best option might not be to parse the huge logs after-the-fact, but have the app generate the simplified logs you want.

If you can't modify the app, you can possibly create a small daemon that watches the log file for changes and parses a line as soon as it is written. This approach doesn't actually save you any total processing time, but does spread it out so it doesn't bog the machine down.
 
Old 04-07-2009, 10:32 AM   #52
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 4,481

Rep: Reputation: 454Reputation: 454Reputation: 454Reputation: 454Reputation: 454
Quote:
Originally Posted by tuxdev View Post
can you modify the app that generates the logs? If so, then the best option might not be to parse the huge logs after-the-fact, but have the app generate the simplified logs you want.

If you can't modify the app, you can possibly create a small daemon that watches the log file for changes and parses a line as soon as it is written. This approach doesn't actually save you any total processing time, but does spread it out so it doesn't bog the machine down.
Exactly my point.
 
Old 04-07-2009, 10:35 AM   #53
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 4,481

Rep: Reputation: 454Reputation: 454Reputation: 454Reputation: 454Reputation: 454
By the way, if number of whitespaces between columns is constant, 'cut' (yes, 'cut', not 'cat') can be used - internally it should be simpler than 'awk', so maybe faster.

Anyway, that won't solve the IO speed problem.
 
Old 04-07-2009, 11:01 AM   #54
jglands
LQ Newbie
 
Registered: Apr 2009
Posts: 17

Rep: Reputation: 1
Why isn't you didn't write the script in Fortran?
 
Old 04-07-2009, 09:32 PM   #55
jglands
LQ Newbie
 
Registered: Apr 2009
Posts: 17

Rep: Reputation: 1
I knew you would surrender.

It was my intense debating skills. If I can make Microsoft look good I can make anyone look good. ;-)
 
Old 04-07-2009, 10:15 PM   #56
crabboy
Senior Member
 
Registered: Feb 2001
Location: Atlanta, GA
Distribution: Slackware
Posts: 1,821

Rep: Reputation: 121Reputation: 121
Please try to stay on topic. This thread is completely polluted with BS.

On topic:

The inner loop of your program that is using strtok is very inefficient. It is really killing your programs performance. I wrote a short program using read to grab chunks of data and pointers to scan the buffer and it would still not top the performance I was getting with awk. I think that the data copy into the buffer was slowing it down, so I switched to using mmap. With mmap I was able to top awk's performance.

Code:
#include <stdio.h>
#include <ctype.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <unistd.h>

int readsize(long offset,  long filesize, long pagesize)
{
   int left = filesize - offset;
   if ( left > pagesize )
      return (pagesize);
   else
      return (left);
}

void process( char * pszFileName )
{
   int fd;
   void * pmap;
   char * pBuf;
   char * pEnd;
   int fieldcount = 0;
   char ch = (char) NULL;
   char lastChar = (char) NULL;
   int filesize = 0;
   int readlen = 0;
   int offset = 0;
   long pagesize = 0;

   pagesize = sysconf(_SC_PAGE_SIZE);

   fd = open( pszFileName, O_RDONLY );

   filesize = lseek( fd, 0, SEEK_END );
   readlen = readsize( 0, filesize, pagesize );

   while ( readlen > 0 )
   {
      pmap = mmap( 0, readlen, PROT_READ, MAP_SHARED, fd, offset );
      pBuf = (char *)pmap;
      pEnd = (char *)pmap + readlen;

      while ( pBuf < pEnd )
      {
         ch = *pBuf++;
         if ( ch == '\n' )
         {
            printf("\n");
            fieldcount = 0;
            lastChar = (char)NULL;
            continue;
         }
         else if ( (!isspace(ch)) && (isspace(lastChar)))
            fieldcount++;
         else if ( (isspace(ch)) && (isspace(lastChar)))
            continue;

         switch ( fieldcount + 1 )
         {
               case 1:
               case 2:
               case 4:
               case 6:
               case 11:
               case 12:
               case 13:
                  printf("%c", ch);
                  break;
         }
         lastChar = ch;
      }

      munmap( pmap, readlen );
      offset = offset + readlen;
      readlen = readsize( offset, filesize, pagesize );
   }
}

main(int argc, char *argv[])
{
   if ( argc == 2 )
   {
      process( argv[1] );
   }
}
 
Old 04-07-2009, 10:47 PM   #57
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
Quote:
Originally Posted by vache View Post
I've done benchmarks using perl, ruby, php, and python scripts (all were slower than awk).
you can try mmap. Perl mmap, Python mmap
 
Old 04-07-2009, 10:51 PM   #58
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
Quote:
Originally Posted by crabboy View Post
Please try to stay on topic. This thread is completely polluted with BS.
are you able to delete all those BS?
 
  


Reply

Tags
ascii, awk, fgets, fopen, parse



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
how can I differentiate two large text files using shell script? Files are like below surya_gadde Linux - Software 1 01-20-2009 02:52 AM
parse input text file and generate output TsanChung Programming 5 07-27-2008 10:23 PM
How to parse text file to a set text column width and output to new text file? jsstevenson Programming 12 04-23-2008 02:36 PM
sed script to read only columns 4 to 6 in output database cranium2004 Programming 10 02-28-2006 07:20 AM
How to parse log files into text view using GLADE shandy^^^ Programming 8 02-07-2006 08:13 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 09:58 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration