LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 09-06-2006, 01:57 AM   #1
hedpe
Member
 
Registered: Jan 2005
Location: Pittsburgh
Distribution: Ubuntu
Posts: 378

Rep: Reputation: 30
need help performance optimizing this perl script


Hey all,

I have about 400GB of compressed data, which consists of 25,000 files with about 30,000 lines each. I need to process every single line of every single file. What I need to do for every single line is simple, like incrementing a counter. The disk is connected via USB... and therefore, I would expect this type of application to be I/O bound on a recent machine (3.4GHz pentium).

However, profiling my application shows its processor bound:
Code:
real    0m49.125s   
user    0m48.900s   
sys     0m0.220s
So I've done a little profiling:
Code:
bash-2.05b$ dprofpp
Total Elapsed Time = 50.53591 Seconds
  User+System Time = 50.53591 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
 14.8   7.480  7.018 923304   0.0000 0.0000  Compress::Zlib::gzFile::gzreadline
 0.21   0.106  0.132      1   0.1058 0.1316  File::Find::_find_dir
 0.06   0.030  0.026   8426   0.0000 0.0000  main::__ANON__
 0.06   0.030  0.070      4   0.0075 0.0174  main::BEGIN
 0.02   0.010  0.010      1   0.0100 0.0100  Exporter::Heavy::heavy_export
 0.02   0.010  0.020      3   0.0033 0.0067  utf8::SWASHNEW
 0.02   0.010  0.010      3   0.0033 0.0033  warnings::register::import
 0.02   0.010  0.010      4   0.0025 0.0025  IO::BEGIN
 0.00   0.000 -0.000     10   0.0000      -  strict::unimport
 0.00   0.000 -0.000      1   0.0000      -  AutoLoader::import
 0.00   0.000  0.010     12   0.0000 0.0008  Exporter::import
 0.00   0.000  0.010      7   0.0000 0.0014  IO::Handle::BEGIN
 0.00   0.000 -0.000      1   0.0000      -  Symbol::BEGIN
 0.00   0.000 -0.000      2   0.0000      -  SelectSaver::BEGIN
 0.00   0.000 -0.000      1   0.0000      -  warnings::BEGIN
The majority of calls goto reading a line from the compressed file, which I am doing via a Compress::ZLib module.

I was hoping that someone could help me optimize the runtime of this program and make it more on the I/O bound side which I would expect it to be. At 50 seconds a file, 25,000 files... we're looking at roughly 17 days. Is this a one time pass thing? Sure. Do I have a deadline in 17 days? No. But I will have to make lots of similar applications to extract information from this data, and in the long run, it would save me some time.

So I'd greatly appreciate any suggestions. Yes, the code is ugly because I just learned perl this week. So of course, readability suggestions are welcome too... just trying to get across that theres no need for "omg look at this code!!!!!!" Yes, I know.

Thanks
George

Code:
#!/usr/bin/perl                                                                                                                             use lib "$ENV{HOME}/myperl/lib";
use Compress::Zlib;
use File::Find; 
use strict;
use warnings;
 
my %ips=();
my %ports=();
my @files; 
 
File::Find::find({wanted=>sub{ -f $_ and push @files, $_ }}, '/mnt/campus-2005-1TB/CAMPUS-2005-BKP/Data/archive/2005/02/');
 
for my $file (sort @files) {
  my @core= split(/\./, $file);
  my $fullname="/mnt/campus-2005-1TB/CAMPUS-2005-BKP/Data/archive/$core[1]/$core[2]/$core[3]/$core[4]/core-full.$core[1].$core[2].$core[3].$core[4].$core[5].gz";
 
  print "$file\n";
 
  # Now we open up the data file
  my $gz = gzopen($fullname, "rb") 
    or die "Cannot open $file: $gzerrno\n" ;
 
  # Read in the lines, split them by whitespace, and add up counts
  while($gz->gzreadline(my $line) > 0) {  # Read in the lines from the blinc file
    my @tokens = split(' ', $line);
    my $size = scalar @tokens;
 
    if($size != 11 || $tokens[0] eq "StartTime") {  # Make sure we have all the fields we need
      next;
    } 
 
    # Extract the IP address and port from the x.x.x.x.x format, which is horrible
    my @left = split(/\./, $tokens[3]);
    my @right = split(/\./, $tokens[5]);
    my $left_ip = "$left[0].$left[1].$left[2].$left[3]";
    my $left_port="";
    my $right_ip = "$right[0].$right[1].$right[2].$right[3]";
    my $right_port="";
 
    if(exists $left[4]) {
      $left_port = "$left[4]";
      if(!exists $ports{"$left_port"}) { $ports{"$left_port"}[0]=0; $ports{"$left_port"}[1]=0; };
      $ports{"$left_port"}[0]++;
    }
    if(exists $right[4]) {
      $right_port = "$right[4]";
      if(!exists $ports{"$right_port"}) { $ports{"$right_port"}[0]=0; $ports{"$right_port"}[1]=0; };
      $ports{"$right_port"}[1]++;
    }
 
    if(!exists $ips{"$left_ip"}) { $ips{"$left_ip"}[0]=0; $ips{"$left_ip"}[1]=0; };
    $ips{"$left_ip"}[0]++;
    if(!exists $ips{"$right_ip"}) { $ips{"$right_ip"}[0]=0; $ips{"$right_ip"}[1]=0; };
    $ips{"$right_ip"}[1]++;
  } 
 
  die "Error reading from $file: $gzerrno\n"
    if $gzerrno != Z_STREAM_END ;
 
  $gz->gzclose() ;
 
  open(FILEOUT, ">>$file-raw_addr") or die "can't open\n";
  foreach my $key (keys %ips) {
    print FILEOUT "$key $ips{$key}[0] $ips{$key}[1]\n";
  }
  close(FILEOUT);
 
  open(FILEOUT, ">>$file-raw_ports") or die "can't open\n";
  foreach my $key (keys %ports) {
    print FILEOUT "$key $ports{$key}[0] $ports{$key}[1]\n";
  }
  close(FILEOUT);

  exit;  # Lets just exit after 1 file for benchmarking 
}
 
Old 09-06-2006, 02:49 AM   #2
Tinkster
Moderator
 
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :} ... android.
Posts: 22,985
Blog Entries: 11

Rep: Reputation: 879Reputation: 879Reputation: 879Reputation: 879Reputation: 879Reputation: 879Reputation: 879
I don't know how gzreadline goes about retrieving lines from
the compressed file, or how big the files actually are. If you
have plenty of RAM you could try to read the lot (the whole file,
decompress it, and then operate on the lines in memory) which
should replace the calls to gzreadline with an operation that
doesn't require much CPU.

Decompression is, of course, by its very nature CPU intense.


Cheers,
Tink
 
Old 09-06-2006, 06:20 AM   #3
makyo
Member
 
Registered: Aug 2006
Location: Saint Paul, MN, USA
Distribution: {Free,Open}BSD, CentOS, Debian, Fedora, Solaris, SuSE
Posts: 718

Rep: Reputation: 72
Hi.
Quote:
What I need to do for every single line is simple, like incrementing a counter.
Although there are always improvements we can make in code, the best improvements are going to come from algorithm changes. Because your requirements are very simple, we can't do much better, increasing a counter is very basic, and I don't see any way of doing better in that area. Perl is quite good at doing most things, and compares favorably to other languages, especially considering the really expensive part of computing -- the human time.
Quote:
But I will have to make lots of similar applications to extract information from this data, and in the long run, it would save me some time.
We can usually trade time for space. So if it's taking a lot of CPU time to do the decompression, then decompress the files for a few weeks until you are finished with your analyses, then re-compress them.

I have been recommending Damian Conway's book, Perl best Practices, to people, but if this is really a one-shot project, then I wouldn't bother.

Best wishes ... cheers, makyo

Last edited by makyo; 09-06-2006 at 06:22 AM.
 
Old 09-06-2006, 09:53 AM   #4
hedpe
Member
 
Registered: Jan 2005
Location: Pittsburgh
Distribution: Ubuntu
Posts: 378

Original Poster
Rep: Reputation: 30
Quote:
Originally Posted by Tinkster
I don't know how gzreadline goes about retrieving lines from
the compressed file, or how big the files actually are. If you
have plenty of RAM you could try to read the lot (the whole file,
decompress it, and then operate on the lines in memory) which
should replace the calls to gzreadline with an operation that
doesn't require much CPU.
I like this suggestion a lot. However, I am not sure how to do this with Compress::Zlib. I just want to read all of the lines into an array, each element being a new line.

On their manual, it has something like this:
http://search.cpan.org/~pmqs/Compress-Zlib-1.42/Zlib.pm
Code:
print $buffer while $gz->gzread($buffer) > 0;
This does not read everything in at once, right? Maybe I need to push this into an array?
 
Old 09-06-2006, 12:00 PM   #5
makyo
Member
 
Registered: Aug 2006
Location: Saint Paul, MN, USA
Distribution: {Free,Open}BSD, CentOS, Debian, Fedora, Solaris, SuSE
Posts: 718

Rep: Reputation: 72
Hi.

The embedded Compress read probably has a lot of overhead.

I compared the gzopen/read, etc., with a straight read of an uncompressed file, and then a gunzip of a compressed file piped into the straight read. The results:
Code:
% ./compare
 Compressed read:

real    0m0.036s
user    0m0.023s
sys     0m0.008s
 206 lines.

 UN-compressed read:

real    0m0.007s
user    0m0.006s
sys     0m0.001s
 206 lines.

 compressed read by gunzip, then piped:

real    0m0.008s
user    0m0.006s
sys     0m0.002s
 206 lines.
This suggests to me that one should be doing the decompression outside the perl code. Not only does this appear to be considerably faster, but you don't need to deal with added complexity of Compress in the perl code ... cheers, makyo
 
Old 09-06-2006, 05:05 PM   #6
hedpe
Member
 
Registered: Jan 2005
Location: Pittsburgh
Distribution: Ubuntu
Posts: 378

Original Poster
Rep: Reputation: 30
could you post your "compare" script, so I can run it on the machine to see how performance is with my disk relative to the different methods?
 
Old 09-06-2006, 05:14 PM   #7
hedpe
Member
 
Registered: Jan 2005
Location: Pittsburgh
Distribution: Ubuntu
Posts: 378

Original Poster
Rep: Reputation: 30
you know what though, to read through the file with gzopen, its only taking about 5 seconds:

Code:
bash-2.05b$ time ./test 

real    0m5.190s
user    0m5.160s
sys     0m0.030s
test:
Code:
#!/usr/bin/perl
use lib "$ENV{HOME}/myperl/lib";
use Compress::Zlib;
use File::Find; 
use strict;
use warnings;

        my $file="core-full.2005.02.01.00.00.gz";
        my $fullname="/mnt/campus-2005-1TB/CAMPUS-2005-BKP/Data/archive/2005/02/01/00/core-full.2005.02.01.00.00.gz";

        my %ips=();
        my %ports=();

        # Now we open up the data file
        my $gz = gzopen($fullname, "rb") 
                or die "Cannot open $fullname: $gzerrno\n" ;

        # Read in the lines, split them by whitespace, and add up counts
        while($gz->gzreadline(my $line) > 0) {  # Read in the lines from the blinc file
        }
        
        die "Error reading from $fullname: $gzerrno\n" 
                if $gzerrno != Z_STREAM_END ;

        $gz->gzclose() ;
If I modify "test" to include my parsing code, the runtime skyrockets:
Code:
bash-2.05b$ time ./test 

real    0m56.542s
user    0m56.400s
sys     0m0.140s
the new version of test:
Code:
#!/usr/bin/perl
use lib "$ENV{HOME}/myperl/lib";
use Compress::Zlib;
use File::Find;
use strict;
use warnings;

  my $file="core-full.2005.02.01.00.00.gz";
  my $fullname="/mnt/campus-2005-1TB/CAMPUS-2005-BKP/Data/archive/2005/02/01/00/core-full.2005.02.01.00.00.gz";

  my %ips=();
  my %ports=();

  # Now we open up the data file
  my $gz = gzopen($fullname, "rb")
    or die "Cannot open $fullname: $gzerrno\n" ;

  # Read in the lines, split them by whitespace, and add up counts
  while($gz->gzreadline(my $line) > 0) {  # Read in the lines from the blinc file
    my @tokens = split(' ', $line);
    my $size = scalar @tokens;

    if($size != 11 || $tokens[0] eq "StartTime") {  # Make sure we have all the fields we need
      next;
    }
    
    # Extract the IP address and port from the x.x.x.x.x format, which is horrible
    my @left = split(/\./, $tokens[3]);
    my @right = split(/\./, $tokens[5]);
    my $left_ip = "$left[0].$left[1].$left[2].$left[3]";
    my $left_port="";
    my $right_ip = "$right[0].$right[1].$right[2].$right[3]";
    my $right_port="";

    if(exists $left[4]) {
      $left_port = "$left[4]";
      if(!exists $ports{"$left_port"}) { $ports{"$left_port"}[0]=0; $ports{"$left_port"}[1]=0; };
      $ports{"$left_port"}[0]++;
    }
    if(exists $right[4]) {
      $right_port = "$right[4]";
      if(!exists $ports{"$right_port"}) { $ports{"$right_port"}[0]=0; $ports{"$right_port"}[1]=0; };
      $ports{"$right_port"}[1]++;
    }                                                                                         
    
    if(!exists $ips{"$left_ip"}) { $ips{"$left_ip"}[0]=0; $ips{"$left_ip"}[1]=0; };
    $ips{"$left_ip"}[0]++;
    if(!exists $ips{"$right_ip"}) { $ips{"$right_ip"}[0]=0; $ips{"$right_ip"}[1]=0; };
    $ips{"$right_ip"}[1]++;
  }
  
  die "Error reading from $fullname: $gzerrno\n"
    if $gzerrno != Z_STREAM_END ;
  
  $gz->gzclose() ;
 
Old 09-06-2006, 09:36 PM   #8
makyo
Member
 
Registered: Aug 2006
Location: Saint Paul, MN, USA
Distribution: {Free,Open}BSD, CentOS, Debian, Fedora, Solaris, SuSE
Posts: 718

Rep: Reputation: 72
Hi, hedpe.

Well, doing a rough comparison between your timed read and mine (assuming we have computers that are somewhat similar), you have perhaps 200,000 lines in the file. So doing the processing on that many lines is just what it takes.

How many lines are in your test file?

You can shave off some of the time by not using Compress, as you saw from my benchmark.

I didn't notice any use of more than 6 tokens, so you could limit the split to see if that makes a difference.

I don't think there is much global optimization in perl, so you might want to look over the arithmetic that you do to avoid repeated calculations, make sure you are keeping values in variables, etc.

I suggest you calculate some figure of merit, for example, CPU time / line, so that you can see how changes will affect the figure.

Other than that, the time is what it is, and you might just have to live with that ... cheers, makyo
 
Old 09-07-2006, 01:40 AM   #9
hedpe
Member
 
Registered: Jan 2005
Location: Pittsburgh
Distribution: Ubuntu
Posts: 378

Original Poster
Rep: Reputation: 30
hey makyo, thanks for all the help

My test file is much larger, which is the typical size of the files I am reading from... 923,303 lines from the test file

I was able to shave off ~3 seconds with your suggestion, which makes me strongly believe its the actual line processing that is the hog, and maybe not the decompression/reading of the lines

I broke down and coded it in C, the exact same functionality with glib hashes takes my runtime down from ~55 seconds, to 9 seconds

That's a huge difference... I am using external decompression in both environments.

Maybe perl just wasn't made for this type of parsing?

Of course, the size of the C code is huge compared to the size of the perl code... 425 lines to 75

Last edited by hedpe; 09-07-2006 at 01:43 AM.
 
Old 09-07-2006, 03:34 AM   #10
Tinkster
Moderator
 
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :} ... android.
Posts: 22,985
Blog Entries: 11

Rep: Reputation: 879Reputation: 879Reputation: 879Reputation: 879Reputation: 879Reputation: 879Reputation: 879
I'd be interested to see some of the data, and the bit's you really need, too.
Maybe your processing could be optimised, too.


Cheers,
Tink
 
Old 09-07-2006, 06:06 AM   #11
makyo
Member
 
Registered: Aug 2006
Location: Saint Paul, MN, USA
Distribution: {Free,Open}BSD, CentOS, Debian, Fedora, Solaris, SuSE
Posts: 718

Rep: Reputation: 72
Hi, hedpe.

Re-writing it in C should certainly help. Given the amount of data you have, that seems to be a good solution, especially because your needs for processing are relatively simple.

I think that perl is right for this type of work, but perhaps not for this volume of work. I hardly ever get to use this piece of information, but -- in one study done on languages, perl was found to cause about 500K machine instructions to be executed for each line of perl. Other languages get much closer to the machine, so, as you noted in c, the number gets much smaller than 500k. The correspondence of c is not 1-for-1 as it might be in assembly, but it is small. So, while perl is suitable for many tasks -- even including this one -- you almost always do better if you have the time and skills -- and, for your good fortune, you have those skills. You have found the balance of resources that seems to be correct for this problem.

There are some ways of combining c and perl, but that is quite complex currently (said to be easier in the next version of perl).

So, from my point of view, I think you have solved your problem, and, as with many problems, we have all learned something. I think the issue of scaling is well-illustrated in that a solution may not always scale up as we might desire as the size of the problem scales up.

Best wishes ... cheers, makyo

Last edited by makyo; 09-07-2006 at 06:12 AM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Building a custom kernel optimizing for media performance vharishankar Linux - General 1 02-22-2006 06:30 AM
optimizing perl parse file. eastsuse Programming 1 12-22-2004 02:49 AM
Optimizing the performance of a Via S3 ProSavage8 Video Card Adrohak Linux - Hardware 0 11-02-2004 06:08 PM
Optimizing Apache for performance on FreeBSd !! apache Linux - Networking 1 07-28-2004 09:07 AM
Optimizing performance with hdparm Twiggy794 Linux - Hardware 1 01-14-2004 08:34 PM


All times are GMT -5. The time now is 10:35 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration