LinuxQuestions.org
View the Most Wanted LQ Wiki articles.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
LinkBack Search this Thread
Old 08-08-2008, 05:27 AM   #1
gaynut
LQ Newbie
 
Registered: Jan 2008
Posts: 27

Rep: Reputation: 15
Question PERL pattern matching while reading a file content is dead slow


Hi,
I have a piece of code, that is slowing down my entire program.
The patterns are stored in a file called "alloc_ids".They resemble "ALLOC_xxxx". I have to match every allocid against a list of files and if found , write a "fprintf" statement in the file.

The code is:
sub write_file {

my ($filename,$alloc_mem)=@_[0,1];
$alloc_mem =~ s/\s+//g;
print "$alloc_mem \t $filename";
open (FH,"$dir_path/$filename" || die "file not found");
open (FW,">$dir_path/$filename.swp");
while ( $line = <FH> ) {
print FW "$line";
if ( $line =~ /$alloc_mem/ ) {
print FW "fprintf \(stderr\,\"Accessing Allocid \%x\"\,$alloc_mem\)\;";
}
}

}


Problem:

When a file doesn't contain a pattern ALLOC_xxx at all ,the process time is very quick. However, when it contains a pattern ALLOC_xxx which may not exactly match the "$alloc_mem", the code is very slow.
I feel, reading every line in the file and matching the contents against a pattern makes it dead slow. Runs for hours together.
Although, my file contents are huge, is there no option to run my script fast?

Is there any way to overcome this?
Any other way of writing the script?


Thanks,
~ gaynut
 
Old 08-09-2008, 06:16 AM   #2
matthewg42
Senior Member
 
Registered: Oct 2003
Location: UK
Distribution: Kubuntu (x86), Debian (PPC)
Posts: 3,528

Rep: Reputation: 60
Quote:
Any other way of writing the script?
Remember the Perl mantra: TMTOWTDI.

I don't see anything in your code which should take a particularly long time to execute. Maybe reading from this file is slow because the IO is overloaded? Is the file slow to read by other programs (e.g. grep it for the same pattern and see if that takes the same time).
 
Old 08-09-2008, 08:38 AM   #3
estabroo
Senior Member
 
Registered: Jun 2008
Distribution: debian, ubuntu, sidux
Posts: 1,051
Blog Entries: 2

Rep: Reputation: 93
I'd bet its the prints that are slowing you down. Instead of doing the print to stderr, accumulate the string in an array, then after you've finished the file print the array.

Or would that not work, looking at your code a second time I see your actually printing that line to the FW.

Last edited by estabroo; 08-09-2008 at 08:40 AM. Reason: second thoughts
 
Old 08-09-2008, 09:24 AM   #4
ne pas
Member
 
Registered: Jul 2008
Posts: 52

Rep: Reputation: 21
Quote:
Originally Posted by estabroo View Post
I'd bet its the prints that are slowing you down. Instead of doing the print to stderr, accumulate the string in an array, then after you've finished the file print the array.
No, it's the RegExp, it gets re-compiled on every loop-iteration. But please don't keep the whole output in memory unless it's not otherwise possible.

Code:
open (FH,"$dir_path/$filename" || die "file not found");
here, the script will never die. It should be

Code:
open (FH, "$dir_path/$filename") || die "File not found\n";
or better use "or" instead of "||", it has lower presedence,
but it would be much better to use the following, it's more readable:
Code:
if(open (FH, "$dir_path/$filename")) {
    # do something with FH
}
else {
    die "Error: Can't open file '$dir_path/$filename': $!\n";
}
or use unless, as you can see in code at the bottom.

Code:
if ( $line =~ /$alloc_mem/ ) {
    print FW "fprintf \(stderr\,\"Accessing Allocid \%x\"\,$alloc_mem\)\;";
}
The Pattern ($alloc_mem) gets re-compiled for every loop-iteration, use the
o-modifier or define the RegExp outside of the while-loop once, that should help.

Code:
if ( $line =~ /$alloc_mem/o ) { # o-modifier: compile pattern only once
    # ...
}
A full cleanup of your code, but beware, it's written on-the-fly (untested).
Code:
use FileHandle;
use File::Spec;

sub write_file {
    my($filename, $alloc_mem) = @_;

    $alloc_mem =~ s/\s+//g;

    print "$alloc_mem \t $filename";

    # concatenate dir_path and filename
    my $inputfile_path = File::Spec->catfile( $dir_path => $filename );

    my $inputfile_handle = FileHandle->new;
    
    unless($inputfile_handle->open($inputfile_path, "r")) {
        die "Error: Can't open file '$inputfile_path': $!\n";
    }

    my $outputfile_path = $inputfile_path . ".swp";

    my $outputfile_handle = FileHandle->new;

    unless($outputfile_handle->open($outputfile_path), "w") {
        die "Error: Can't open file '$outputfile_path': $!\n";
    }

    # pre-define regexp
    my $RE =~ qr/$alloc_mem/;

    while(defined(my $line = $inputfile_handle->getline)) {
        if($line =~ m/$RE/) {
            $outputfile_handle->print(qq{fprintf (stderr, "Accessing Allocid %x ", $alloc_mem);});
        }
    }
}

Last edited by ne pas; 08-09-2008 at 09:28 AM. Reason: code errors removed
 
Old 08-09-2008, 09:27 AM   #5
estabroo
Senior Member
 
Registered: Jun 2008
Distribution: debian, ubuntu, sidux
Posts: 1,051
Blog Entries: 2

Rep: Reputation: 93
Are you sure its the regex recompile, because that would happen regardless of whether the pattern occured in the file or not and the OP stated that if the pattern doesn't occur it goes very quickly.
 
Old 08-09-2008, 09:57 AM   #6
ne pas
Member
 
Registered: Jul 2008
Posts: 52

Rep: Reputation: 21
Quote:
Originally Posted by estabroo View Post
Are you sure its the regex recompile, because that would happen regardless of whether the pattern occured in the file or not and the OP stated that if the pattern doesn't occur it goes very quickly.
No, you're right I haven't read this, sorry for that, it's probably the print. But the OP can gain a speed-up if he follow my suggestions.

Code:
use strict;
use warnings;

use Benchmark;

timethese(10 => {
    'with-o' => sub
        {
            my @str = ('AAAA'..'ZZZZ');
            my $pattern = join '' => ('AAA'..'ZZZ');

            my $i = 0;
            for(@str) {
                if(/$pattern/o) {
                     $i++;
                }
            }
            print "$i\n";
        },

    'without-o' => sub
        {
            my @str = ('AAAA'..'ZZZZ');
            my $pattern = join '' => ('AAA'..'ZZZ');

            my $i = 0;
            for(@str) {
                if(/$pattern/) {
                    $i++;
                }
            }
            print "$i\n";
        }
});
 
Old 08-09-2008, 10:50 AM   #7
Su-Shee
Member
 
Registered: Sep 2007
Location: Berlin
Distribution: Slackware
Posts: 509

Rep: Reputation: 41
My idea would be discarding the greedy "+" which will involve some backtracking, because perl's regex engine would go forward to check wether or not something possibly matching will follow and than go back afterwards if not.

Maybe \s\s\s\s will help or \s{4} or +? (disable greediness of quantifiers). Or be more precise but \s - \s is an entire class of whitespaces to choose from.

Also consider the "compile just once" flag (m///o) for regexes, because you don't have any variable stuff in your match pattern, so it doesn't has to be re-evaluated every time it runs in the loop.

This line:

while(defined(my $line = $inputfile_handle->getline))

looks extremely un-perlish to me - maybe you can try the more typical

while(<FILEHANDLE>) here as you already did in your first example.

It contained $line = <FH> in your while loop, which is also unneccessary, because you don't need to assign the line from <FH> by hand.

Perl sets automatically $_ to one line at a time. Use $_ in the loop body to match the regex:

$_ =~ m/foobar/;

Alternatively you can use foreach $line (<FH>) - don't know wether or not this is faster, probably not.

So, try something like this:

Code:
while(<FH>) {
    print "FOOBAR" if /\s\s\s\s/o; #assumes $_ automatically
}

Last edited by Su-Shee; 08-09-2008 at 10:52 AM.
 
Old 08-09-2008, 11:42 AM   #8
ne pas
Member
 
Registered: Jul 2008
Posts: 52

Rep: Reputation: 21
Quote:
Originally Posted by Su-Shee View Post
while(defined(my $line = $inputfile_handle->getline))

looks extremely un-perlish to me - maybe you can try the more typical

while(<FILEHANDLE>) here as you already did in your first example.
Maybe it's not that perlish but it's much safer and more readable.
It is recommended not to use bareword FILEHANDLEs but indirect filehandles created by FileHandle->new like in my code above or by open().
Code:
open(my $fh, '<', $filename) or die $!;
while(defined(my $line = <$fh>)) { # same as $fh->getline, but less readable
    # ...

}
close($fh) or die $!;
Quote:
Originally Posted by Su-Shee
Perl sets automatically $_ to one line at a time. Use $_ in the loop body to match the regex:
Bad recommentation. See what happens if you use $_ without localizing it first.
Code:
#!/usr/bin/perl

  use strict;
  use warnings;

  while(<DATA>) {
      bad_sub();
      print;
  }

  sub bad_sub {
      $_ = "blue";
  }

__DATA__
line1
line2
line3
line4
Output:
Quote:
blueblueblueblueblueblueblueblueblue
 
Old 08-09-2008, 06:29 PM   #9
chrism01
Guru
 
Registered: Aug 2004
Location: Brisbane
Distribution: Centos 6.2, Centos 5.8
Posts: 11,740

Rep: Reputation: 905Reputation: 905Reputation: 905Reputation: 905Reputation: 905Reputation: 905Reputation: 905Reputation: 905
I agree the compile once option ('o') for the regex is a good idea. The input file is apparently 'huge', so if it doesn't match, it saves on a 'huge' amt of I/O, which is consistent with the OP.
Incidentally, the 3 arg open is recommended these days:

open(IN_FILE, "<", "filename") or die "Unable to open $filename: $!\n";

using the lower precedence 'or' and printing the error msg ($!).

If we knew the actual file size and machine config it would be useful.

However, it sounds like a good candidate for divide-and-conquer:
a loose rule of thumb is that RAM is 10x slower than cpu, and disk is 10x slower than RAM, so I'd look at breaking the file into sections and multi-tasking eg fork or threads or just multiple copies of the prog operating on different file parts.
The cpu will be doing a lot of waiting on I/O, so we might as well use those cpu cycles, and then merge (concat) the results.

Last edited by chrism01; 08-09-2008 at 06:39 PM.
 
Old 08-12-2008, 06:41 AM   #10
gaynut
LQ Newbie
 
Registered: Jan 2008
Posts: 27

Original Poster
Rep: Reputation: 15
Hi All,
Thanks for the replies. Btw..sorry for the delay in my response.
I tried few of ur suggestions,
added compile once modifier.

tried the code and the tips given by ne_pas.
Alas, nothing sped up my code.

To answer few of ur queries,
yes, grep takes much time on those files.
The files sizes range from 5K to 14M. Running on Novell Desktop Linux version 9.
My PERL version is 5.8.3.

The file 14M is the one that maximum time. Yet, i observed one more behaviour,
When there is a pattern match, the code should write a line in the output swp file.
Strangely, it happens only for the last pattern it matches. The previous pattern matches are eaten away.
I couldn't understand, if there is a flaw in my logic.

thanks,
~ gaynut
 
Old 08-12-2008, 07:43 AM   #11
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 10,458

Rep: Reputation: 623Reputation: 623Reputation: 623Reputation: 623Reputation: 623Reputation: 623
My perl experience is somewhat limited, but when I have been doing significant tests, I have found it to be extremely efficient. So much so that even I can write code to outperform sed - on (generated test) files of several million records in excess of 200 bytes each.
From memory, my run times were well within acceptable limits - say 20 minutes. But this was on a hardware raid5 setup I happened to have laying around unused.

I think you have other problems - I/O contention would be a good place to start looking.
 
Old 08-12-2008, 06:51 PM   #12
chrism01
Guru
 
Registered: Aug 2004
Location: Brisbane
Distribution: Centos 6.2, Centos 5.8
Posts: 11,740

Rep: Reputation: 905Reputation: 905Reputation: 905Reputation: 905Reputation: 905Reputation: 905Reputation: 905Reputation: 905
Yeah, Perl is very efficient, I'd say 80-90% as quick as C for runtime, all things being equal.
Also, these days, 14M is small (medium?), I was assuming something in the GB range...
You should be able to get the whole thing in mem, depending on rec length:


Code:
# match/split
$file="test.txt";
open( TXT_FILE, "<", "$file" ) or  die "Can't open txt file: $file: $!\n";
@arr1 = <TXT_FILE>;
chomp(@arr1);
close(TXT_FILE) or  die "Can't close txt file: $file: $!\n";
Then loop match the array.

Mind you we've only seen the 'core' sub, maybe its something else in the rest of the prog?
Can you cut it down to a minimum prog that has the same problem and show us the code?
 
Old 08-12-2008, 10:02 PM   #13
sundialsvcs
Senior Member
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 3,685

Rep: Reputation: 329Reputation: 329Reputation: 329Reputation: 329
Why don't we tackle this problem "the COBOL way?" Let's go back to the days of yesteryear ... the days of punched cards, even before computers were invented(!) ... and see how they did it.

First, sort the file that contains those ALLOC_xxx strings. (Be sure in all cases to use a disk-based sort...)

Now, write a program that scans all of the files to find every occurrence of such a string, printing out a list of: ALLOC_xxx filename.

Sort and de-dupe this file. (Sorting programs can de-dupe.)

Now... all you need to do is to compare these two sorted files. You can do this, literally, one record at a time, because the sorting operation has placed every ALLOC_xxx value adjacent (and in a predictable sequence) in both files. (Once again, sorting programs can probably do this.)

In (literally) a matter of a minute or two, and maybe much faster, the job can be completed ... and it really doesn't matter how many records or hits were found.
 
Old 08-13-2008, 06:59 AM   #14
gaynut
LQ Newbie
 
Registered: Jan 2008
Posts: 27

Original Poster
Rep: Reputation: 15
Thanks Sundialsycs!.
I tried something similar to ur suggestion, although not exactly.
Had my "search pattern" and the "file" containing the pattern listed in a file.
Then i went for writing printf statements in those files. This required a lil work on pattern matching in the beginning to collect "patterns--filename" pairs,yet has helped speed up my program a lil.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Perl pattern matching - greedy wondergirl Programming 2 06-17-2008 03:32 PM
pattern matching in file amitpardesi Linux - Software 5 02-08-2008 07:06 AM
Perl pattern matching in VB rigel_kent Programming 1 05-30-2006 11:00 AM
Perl Pattern Matching Question pete1234 Programming 2 08-27-2005 10:26 AM
pattern matching in perl ludeKing Programming 9 04-02-2004 09:53 AM


All times are GMT -5. The time now is 06:38 PM.

Main Menu
 
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: @linuxquestions
Open Source Consulting | Domain Registration