Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game. |
| Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
 |
GNU/Linux Basic Guide
This 255-page guide will provide you with the keys to understand the philosophy of free software, teach you how to use and handle it, and give you the tools required to move easily in the world of GNU/Linux. Many users and administrators will be taking their first steps with this GNU/Linux Basic guide and it will show you how to approach and solve the problems you encounter.
Click Here to receive this Complete Guide absolutely free. |
|
 |
|
04-07-2009, 10:13 AM
|
#46
|
|
LQ Newbie
Registered: Apr 2009
Posts: 5
Original Poster
Rep:
|
Oh, a little more info about the logs:
The log files are cycled out daily. There are ~100 of them, and they can be as small as a few megs, to over 10GB in size, all raw text. They all have the same number of columns. The data I wanted out of each file were specific columns, with no bias to the actual content inside the respective column (hence the simple awk statement)
I've done benchmarks using perl, ruby, php, and python scripts (all were slower than awk). My C app is about 1 second faster than awk on the same hardware, which really isn't any difference at all.
I just wanted to know if there was an alternative to the well-known (f|safe)read() functions specifically for large files, like the kind I was dealing with, which there isn't.
Thanks for the interesting feedback.
Last edited by vache; 04-07-2009 at 10:16 AM.
|
|
|
|
04-07-2009, 10:20 AM
|
#47
|
|
LQ Newbie
Registered: Apr 2009
Posts: 17
Rep:
|
Quote:
Originally Posted by int0x80
You wish C# could be as good as Java. Good luck with your Xtra Proprietary OS.
|
Windows has Java as well. That's why people create in Java, because they know there is no money in Linux so they want it to work in a real operating system.
|
|
|
|
04-07-2009, 10:22 AM
|
#48
|
|
Member
Registered: Sep 2002
Location: Cincinnati
Distribution: Debian GNU/Linux
Posts: 310
Rep:
|
Quote:
Originally Posted by jglands
Windows has Java as well. That's why people create in Java, because they know there is no money in Linux so they want it to work in a real operating system.
|
So you just explained why C# is unnecessary and another way for MS to rip people off.
FAIL
|
|
|
|
04-07-2009, 10:25 AM
|
#49
|
|
LQ Newbie
Registered: Apr 2009
Posts: 17
Rep:
|
Quote:
Originally Posted by int0x80
So you just explained why C# is unnecessary and another way for MS to rip people off.
FAIL
|
You're just jealous because your coding in Linux isn't making you a dime.
WIN
|
|
|
|
04-07-2009, 10:30 AM
|
#50
|
|
Member
Registered: Sep 2002
Location: Cincinnati
Distribution: Debian GNU/Linux
Posts: 310
Rep:
|
McDonalds took the money they were wasting on MS licenses and gave it to me to write code on Linux.
WIN
|
|
|
|
04-07-2009, 10:31 AM
|
#51
|
|
Senior Member
Registered: Jul 2005
Distribution: Slackware
Posts: 2,006
Rep: 
|
can you modify the app that generates the logs? If so, then the best option might not be to parse the huge logs after-the-fact, but have the app generate the simplified logs you want.
If you can't modify the app, you can possibly create a small daemon that watches the log file for changes and parses a line as soon as it is written. This approach doesn't actually save you any total processing time, but does spread it out so it doesn't bog the machine down.
|
|
|
|
04-07-2009, 10:32 AM
|
#52
|
|
Senior Member
Registered: May 2005
Posts: 4,413
|
Quote:
Originally Posted by tuxdev
can you modify the app that generates the logs? If so, then the best option might not be to parse the huge logs after-the-fact, but have the app generate the simplified logs you want.
If you can't modify the app, you can possibly create a small daemon that watches the log file for changes and parses a line as soon as it is written. This approach doesn't actually save you any total processing time, but does spread it out so it doesn't bog the machine down.
|
Exactly my point.
|
|
|
|
04-07-2009, 10:35 AM
|
#53
|
|
Senior Member
Registered: May 2005
Posts: 4,413
|
By the way, if number of whitespaces between columns is constant, 'cut' (yes, 'cut', not 'cat') can be used - internally it should be simpler than 'awk', so maybe faster.
Anyway, that won't solve the IO speed problem.
|
|
|
|
04-07-2009, 11:01 AM
|
#54
|
|
LQ Newbie
Registered: Apr 2009
Posts: 17
Rep:
|
Why isn't you didn't write the script in Fortran?
|
|
|
|
04-07-2009, 09:32 PM
|
#55
|
|
LQ Newbie
Registered: Apr 2009
Posts: 17
Rep:
|
I knew you would surrender.
It was my intense debating skills. If I can make Microsoft look good I can make anyone look good. ;-)
|
|
|
|
04-07-2009, 10:15 PM
|
#56
|
|
Moderator
Registered: Feb 2001
Location: Atlanta, GA
Distribution: Slackware
Posts: 1,817
Rep: 
|
Please try to stay on topic. This thread is completely polluted with BS.
On topic:
The inner loop of your program that is using strtok is very inefficient. It is really killing your programs performance. I wrote a short program using read to grab chunks of data and pointers to scan the buffer and it would still not top the performance I was getting with awk. I think that the data copy into the buffer was slowing it down, so I switched to using mmap. With mmap I was able to top awk's performance.
Code:
#include <stdio.h>
#include <ctype.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <unistd.h>
int readsize(long offset, long filesize, long pagesize)
{
int left = filesize - offset;
if ( left > pagesize )
return (pagesize);
else
return (left);
}
void process( char * pszFileName )
{
int fd;
void * pmap;
char * pBuf;
char * pEnd;
int fieldcount = 0;
char ch = (char) NULL;
char lastChar = (char) NULL;
int filesize = 0;
int readlen = 0;
int offset = 0;
long pagesize = 0;
pagesize = sysconf(_SC_PAGE_SIZE);
fd = open( pszFileName, O_RDONLY );
filesize = lseek( fd, 0, SEEK_END );
readlen = readsize( 0, filesize, pagesize );
while ( readlen > 0 )
{
pmap = mmap( 0, readlen, PROT_READ, MAP_SHARED, fd, offset );
pBuf = (char *)pmap;
pEnd = (char *)pmap + readlen;
while ( pBuf < pEnd )
{
ch = *pBuf++;
if ( ch == '\n' )
{
printf("\n");
fieldcount = 0;
lastChar = (char)NULL;
continue;
}
else if ( (!isspace(ch)) && (isspace(lastChar)))
fieldcount++;
else if ( (isspace(ch)) && (isspace(lastChar)))
continue;
switch ( fieldcount + 1 )
{
case 1:
case 2:
case 4:
case 6:
case 11:
case 12:
case 13:
printf("%c", ch);
break;
}
lastChar = ch;
}
munmap( pmap, readlen );
offset = offset + readlen;
readlen = readsize( offset, filesize, pagesize );
}
}
main(int argc, char *argv[])
{
if ( argc == 2 )
{
process( argv[1] );
}
}
|
|
|
|
04-07-2009, 10:47 PM
|
#57
|
|
Senior Member
Registered: Aug 2006
Posts: 2,695
|
Quote:
Originally Posted by vache
I've done benchmarks using perl, ruby, php, and python scripts (all were slower than awk).
|
you can try mmap. Perl mmap, Python mmap
|
|
|
|
04-07-2009, 10:51 PM
|
#58
|
|
Senior Member
Registered: Aug 2006
Posts: 2,695
|
Quote:
Originally Posted by crabboy
Please try to stay on topic. This thread is completely polluted with BS.
|
are you able to delete all those BS?
|
|
|
|
| Thread Tools |
Search this Thread |
|
|
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
All times are GMT -5. The time now is 01:28 AM.
|
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|