perl script

johngreg · 10-13-2008, 01:56 PM

i wrote the following small script to find different exception traces and find the count. could you please help me with suggestions to improve performance of this - this has to be run on AIX machine, on files of 100MB size.

Code:

while(<FH>){
    while ( /$exStart(.*?)$exEnd/sgmo ) {
        ++$c;
        if(exists $hashmap{$1}){
            $tempCount=$hashmap{$1};
            $hashmap{$1}= ++$tempCount;
        }
        else{
            $hashmap{$1}=1;
        }
    }
}

foreach $key ( keys %hashmap ) {
        $value = $hashmap{$key};
        print "\n value:: ", $value;
}
$num_keys = keys %hashmap;

thanks

chrism01 · 10-13-2008, 08:05 PM

Remove

Code:

        ++$c;

you don't use that value anywhere.

Replace

Code:

            $tempCount=$hashmap{$1};
            $hashmap{$1}= ++$tempCount;

with

Code:

$hashmap{$1} = ++$hashmap{$1};

replace

Code:

        $value = $hashmap{$key};
        print "\n value:: ", $value;

with

Code:

print "\n value:: $hashmap{$key}";

sundialsvcs · 10-13-2008, 11:16 PM

Depending on how many millions of unique values you might be dealing with in any particular run, a completely different approach might also be called for...

Your present algorithm is based on the assumption that "hashes are 'free.'" Unfortunately, when a hash grows into hundreds-of-thousands or millions of entries, it is no longer free. Instead, every single access runs the risk of a page fault. The application, and the system itself, slows to a crawl...

An "unexpectedly different" algorithm would write all those keys to a disk-file, then sort that file (on disk...), and count them. When a file is sorted, all of the occurences of any particular key-value are guaranteed to be adjacent. "Counting" them, therefore, requires no main-memory at all.

Yes... this is "how they did it with punched cards, even before digital computers existed." And...

... it still works.

(In fact, it can out-perform algorithms such as yours by a factor of thousands . . . )

chrism01 · 10-14-2008, 01:56 AM

To be honest, I'm wondering if you've actually run that or whether its part of a much larger program. Normally I'd expect Perl to rip through a 100MB file pdq...
I can't imagine AIX on a small machine...

keefaz · 10-14-2008, 08:34 AM

Code:

while ( /$exStart(.*?)$exEnd/sgmo ) {

Are you sure you don't want:

Code:

if /$exStart(.*?)$exEnd/sgmo {

Also, don't forget to close FH

johngreg · 10-14-2008, 01:39 PM

Quote:

Originally Posted by chrism01

To be honest, I'm wondering if you've actually run that or whether its part of a much larger program. Normally I'd expect Perl to rip through a 100MB file pdq...
I can't imagine AIX on a small machine...

yes Chris, you are right.. this script runs pretty fast on a single 100MB file, but this needs to be run on 400 such files every hour... so i dint want to face surprises in live servers..

johngreg · 10-14-2008, 01:46 PM

Quote:

Originally Posted by sundialsvcs

Depending on how many millions of unique values you might be dealing with in any particular run, a completely different approach might also be called for...

Your present algorithm is based on the assumption that "hashes are 'free.'" Unfortunately, when a hash grows into hundreds-of-thousands or millions of entries, it is no longer free. Instead, every single access runs the risk of a page fault. The application, and the system itself, slows to a crawl...

An "unexpectedly different" algorithm would write all those keys to a disk-file, then sort that file (on disk...), and count them. When a file is sorted, all of the occurences of any particular key-value are guaranteed to be adjacent. "Counting" them, therefore, requires no main-memory at all.

Yes... this is "how they did it with punched cards, even before digital computers existed." And...

... it still works.

(In fact, it can out-perform algorithms such as yours by a factor of thousands . . . )

let me try: so should i write to a new file with some delimiter, open the temp file using some split and then again wont i need to go to some ds to store the two dimensional values? the exception trace(around 100 lines) and the number of repetitions of that trace?

chrism01 · 10-14-2008, 07:00 PM

Well, you could fork a number of copies, as each file seems to be treated separately.