PERL pattern matching while reading a file content is dead slow
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
PERL pattern matching while reading a file content is dead slow
Hi,
I have a piece of code, that is slowing down my entire program.
The patterns are stored in a file called "alloc_ids".They resemble "ALLOC_xxxx". I have to match every allocid against a list of files and if found , write a "fprintf" statement in the file.
The code is: sub write_file {
my ($filename,$alloc_mem)=@_[0,1];
$alloc_mem =~ s/\s+//g;
print "$alloc_mem \t $filename";
open (FH,"$dir_path/$filename" || die "file not found");
open (FW,">$dir_path/$filename.swp");
while ( $line = <FH> ) {
print FW "$line";
if ( $line =~ /$alloc_mem/ ) {
print FW "fprintf \(stderr\,\"Accessing Allocid \%x\"\,$alloc_mem\)\;";
}
}
}
Problem:
When a file doesn't contain a pattern ALLOC_xxx at all ,the process time is very quick. However, when it contains a pattern ALLOC_xxx which may not exactly match the "$alloc_mem", the code is very slow.
I feel, reading every line in the file and matching the contents against a pattern makes it dead slow. Runs for hours together.
Although, my file contents are huge, is there no option to run my script fast?
Is there any way to overcome this?
Any other way of writing the script?
I don't see anything in your code which should take a particularly long time to execute. Maybe reading from this file is slow because the IO is overloaded? Is the file slow to read by other programs (e.g. grep it for the same pattern and see if that takes the same time).
I'd bet its the prints that are slowing you down. Instead of doing the print to stderr, accumulate the string in an array, then after you've finished the file print the array.
Or would that not work, looking at your code a second time I see your actually printing that line to the FW.
Last edited by estabroo; 08-09-2008 at 08:40 AM.
Reason: second thoughts
I'd bet its the prints that are slowing you down. Instead of doing the print to stderr, accumulate the string in an array, then after you've finished the file print the array.
No, it's the RegExp, it gets re-compiled on every loop-iteration. But please don't keep the whole output in memory unless it's not otherwise possible.
Code:
open (FH,"$dir_path/$filename" || die "file not found");
here, the script will never die. It should be
Code:
open (FH, "$dir_path/$filename") || die "File not found\n";
or better use "or" instead of "||", it has lower presedence,
but it would be much better to use the following, it's more readable:
Code:
if(open (FH, "$dir_path/$filename")) {
# do something with FH
}
else {
die "Error: Can't open file '$dir_path/$filename': $!\n";
}
or use unless, as you can see in code at the bottom.
The Pattern ($alloc_mem) gets re-compiled for every loop-iteration, use the
o-modifier or define the RegExp outside of the while-loop once, that should help.
Code:
if ( $line =~ /$alloc_mem/o ) { # o-modifier: compile pattern only once
# ...
}
A full cleanup of your code, but beware, it's written on-the-fly (untested).
Code:
use FileHandle;
use File::Spec;
sub write_file {
my($filename, $alloc_mem) = @_;
$alloc_mem =~ s/\s+//g;
print "$alloc_mem \t $filename";
# concatenate dir_path and filename
my $inputfile_path = File::Spec->catfile( $dir_path => $filename );
my $inputfile_handle = FileHandle->new;
unless($inputfile_handle->open($inputfile_path, "r")) {
die "Error: Can't open file '$inputfile_path': $!\n";
}
my $outputfile_path = $inputfile_path . ".swp";
my $outputfile_handle = FileHandle->new;
unless($outputfile_handle->open($outputfile_path), "w") {
die "Error: Can't open file '$outputfile_path': $!\n";
}
# pre-define regexp
my $RE =~ qr/$alloc_mem/;
while(defined(my $line = $inputfile_handle->getline)) {
if($line =~ m/$RE/) {
$outputfile_handle->print(qq{fprintf (stderr, "Accessing Allocid %x ", $alloc_mem);});
}
}
}
Last edited by ne pas; 08-09-2008 at 09:28 AM.
Reason: code errors removed
Are you sure its the regex recompile, because that would happen regardless of whether the pattern occured in the file or not and the OP stated that if the pattern doesn't occur it goes very quickly.
Are you sure its the regex recompile, because that would happen regardless of whether the pattern occured in the file or not and the OP stated that if the pattern doesn't occur it goes very quickly.
No, you're right I haven't read this, sorry for that, it's probably the print. But the OP can gain a speed-up if he follow my suggestions.
Code:
use strict;
use warnings;
use Benchmark;
timethese(10 => {
'with-o' => sub
{
my @str = ('AAAA'..'ZZZZ');
my $pattern = join '' => ('AAA'..'ZZZ');
my $i = 0;
for(@str) {
if(/$pattern/o) {
$i++;
}
}
print "$i\n";
},
'without-o' => sub
{
my @str = ('AAAA'..'ZZZZ');
my $pattern = join '' => ('AAA'..'ZZZ');
my $i = 0;
for(@str) {
if(/$pattern/) {
$i++;
}
}
print "$i\n";
}
});
My idea would be discarding the greedy "+" which will involve some backtracking, because perl's regex engine would go forward to check wether or not something possibly matching will follow and than go back afterwards if not.
Maybe \s\s\s\s will help or \s{4} or +? (disable greediness of quantifiers). Or be more precise but \s - \s is an entire class of whitespaces to choose from.
Also consider the "compile just once" flag (m///o) for regexes, because you don't have any variable stuff in your match pattern, so it doesn't has to be re-evaluated every time it runs in the loop.
looks extremely un-perlish to me - maybe you can try the more typical
while(<FILEHANDLE>) here as you already did in your first example.
Maybe it's not that perlish but it's much safer and more readable.
It is recommended not to use bareword FILEHANDLEs but indirect filehandles created by FileHandle->new like in my code above or by open().
Code:
open(my $fh, '<', $filename) or die $!;
while(defined(my $line = <$fh>)) { # same as $fh->getline, but less readable
# ...
}
close($fh) or die $!;
Quote:
Originally Posted by Su-Shee
Perl sets automatically $_ to one line at a time. Use $_ in the loop body to match the regex:
Bad recommentation. See what happens if you use $_ without localizing it first.
Code:
#!/usr/bin/perl
use strict;
use warnings;
while(<DATA>) {
bad_sub();
print;
}
sub bad_sub {
$_ = "blue";
}
__DATA__
line1
line2
line3
line4
I agree the compile once option ('o') for the regex is a good idea. The input file is apparently 'huge', so if it doesn't match, it saves on a 'huge' amt of I/O, which is consistent with the OP.
Incidentally, the 3 arg open is recommended these days:
open(IN_FILE, "<", "filename") or die "Unable to open $filename: $!\n";
using the lower precedence 'or' and printing the error msg ($!).
If we knew the actual file size and machine config it would be useful.
However, it sounds like a good candidate for divide-and-conquer:
a loose rule of thumb is that RAM is 10x slower than cpu, and disk is 10x slower than RAM, so I'd look at breaking the file into sections and multi-tasking eg fork or threads or just multiple copies of the prog operating on different file parts.
The cpu will be doing a lot of waiting on I/O, so we might as well use those cpu cycles, and then merge (concat) the results.
Hi All,
Thanks for the replies. Btw..sorry for the delay in my response.
I tried few of ur suggestions,
added compile once modifier.
tried the code and the tips given by ne_pas.
Alas, nothing sped up my code.
To answer few of ur queries,
yes, grep takes much time on those files.
The files sizes range from 5K to 14M. Running on Novell Desktop Linux version 9.
My PERL version is 5.8.3.
The file 14M is the one that maximum time. Yet, i observed one more behaviour,
When there is a pattern match, the code should write a line in the output swp file.
Strangely, it happens only for the last pattern it matches. The previous pattern matches are eaten away.
I couldn't understand, if there is a flaw in my logic.
My perl experience is somewhat limited, but when I have been doing significant tests, I have found it to be extremely efficient. So much so that even I can write code to outperform sed - on (generated test) files of several million records in excess of 200 bytes each.
From memory, my run times were well within acceptable limits - say 20 minutes. But this was on a hardware raid5 setup I happened to have laying around unused.
I think you have other problems - I/O contention would be a good place to start looking.
Yeah, Perl is very efficient, I'd say 80-90% as quick as C for runtime, all things being equal.
Also, these days, 14M is small (medium?), I was assuming something in the GB range...
You should be able to get the whole thing in mem, depending on rec length:
Code:
# match/split
$file="test.txt";
open( TXT_FILE, "<", "$file" ) or die "Can't open txt file: $file: $!\n";
@arr1 = <TXT_FILE>;
chomp(@arr1);
close(TXT_FILE) or die "Can't close txt file: $file: $!\n";
Then loop match the array.
Mind you we've only seen the 'core' sub, maybe its something else in the rest of the prog?
Can you cut it down to a minimum prog that has the same problem and show us the code?
Why don't we tackle this problem "the COBOL way?" Let's go back to the days of yesteryear ... the days of punched cards, even before computers were invented(!) ... and see how they did it.
First, sort the file that contains those ALLOC_xxx strings. (Be sure in all cases to use a disk-based sort...)
Now, write a program that scans all of the files to find every occurrence of such a string, printing out a list of: ALLOC_xxx filename.
Sort and de-dupe this file. (Sorting programs can de-dupe.)
Now... all you need to do is to compare these two sorted files. You can do this, literally, one record at a time, because the sorting operation has placed every ALLOC_xxx value adjacent (and in a predictable sequence) in both files. (Once again, sorting programs can probably do this.)
In (literally) a matter of a minute or two, and maybe much faster, the job can be completed ... and it really doesn't matter how many records or hits were found.
Thanks Sundialsycs!.
I tried something similar to ur suggestion, although not exactly.
Had my "search pattern" and the "file" containing the pattern listed in a file.
Then i went for writing printf statements in those files. This required a lil work on pattern matching in the beginning to collect "patterns--filename" pairs,yet has helped speed up my program a lil.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.