LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   perl script - need help to improve performance (https://www.linuxquestions.org/questions/programming-9/perl-script-need-help-to-improve-performance-676091/)

johngreg 10-13-2008 01:56 PM

perl script - need help to improve performance
 
i wrote the following small script to find different exception traces and find the count. could you please help me with suggestions to improve performance of this - this has to be run on AIX machine, on files of 100MB size.
Code:

while(<FH>){
    while ( /$exStart(.*?)$exEnd/sgmo ) {
        ++$c;
        if(exists $hashmap{$1}){
            $tempCount=$hashmap{$1};
            $hashmap{$1}= ++$tempCount;
        }
        else{
            $hashmap{$1}=1;
        }
    }
}

foreach $key ( keys %hashmap ) {
        $value = $hashmap{$key};
        print "\n value:: ", $value;
}
$num_keys = keys %hashmap;

thanks

chrism01 10-13-2008 08:05 PM

Remove
Code:

        ++$c;
you don't use that value anywhere.

Replace
Code:

            $tempCount=$hashmap{$1};
            $hashmap{$1}= ++$tempCount;

with
Code:

$hashmap{$1} = ++$hashmap{$1};
replace
Code:

        $value = $hashmap{$key};
        print "\n value:: ", $value;

with
Code:

print "\n value:: $hashmap{$key}";

sundialsvcs 10-13-2008 11:16 PM

Depending on how many millions of unique values you might be dealing with in any particular run, a completely different approach might also be called for...

Your present algorithm is based on the assumption that "hashes are 'free.'" Unfortunately, when a hash grows into hundreds-of-thousands or millions of entries, it is no longer free. Instead, every single access runs the risk of a page fault. The application, and the system itself, slows to a crawl...

An "unexpectedly different" algorithm would write all those keys to a disk-file, then sort that file (on disk...), and count them. When a file is sorted, all of the occurences of any particular key-value are guaranteed to be adjacent. "Counting" them, therefore, requires no main-memory at all.

Yes... this is "how they did it with punched cards, even before digital computers existed." And... :eek: ... it still works.

(In fact, it can out-perform algorithms such as yours by a factor of thousands . . . )

chrism01 10-14-2008 01:56 AM

To be honest, I'm wondering if you've actually run that or whether its part of a much larger program. Normally I'd expect Perl to rip through a 100MB file pdq...
I can't imagine AIX on a small machine...

keefaz 10-14-2008 08:34 AM

Code:

while ( /$exStart(.*?)$exEnd/sgmo ) {
Are you sure you don't want:
Code:

if /$exStart(.*?)$exEnd/sgmo {
Also, don't forget to close FH :)

johngreg 10-14-2008 01:39 PM

Quote:

Originally Posted by chrism01 (Post 3309414)
To be honest, I'm wondering if you've actually run that or whether its part of a much larger program. Normally I'd expect Perl to rip through a 100MB file pdq...
I can't imagine AIX on a small machine...

yes Chris, you are right.. this script runs pretty fast on a single 100MB file, but this needs to be run on 400 such files every hour... so i dint want to face surprises in live servers..

johngreg 10-14-2008 01:46 PM

Quote:

Originally Posted by sundialsvcs (Post 3309321)
Depending on how many millions of unique values you might be dealing with in any particular run, a completely different approach might also be called for...

Your present algorithm is based on the assumption that "hashes are 'free.'" Unfortunately, when a hash grows into hundreds-of-thousands or millions of entries, it is no longer free. Instead, every single access runs the risk of a page fault. The application, and the system itself, slows to a crawl...

An "unexpectedly different" algorithm would write all those keys to a disk-file, then sort that file (on disk...), and count them. When a file is sorted, all of the occurences of any particular key-value are guaranteed to be adjacent. "Counting" them, therefore, requires no main-memory at all.

Yes... this is "how they did it with punched cards, even before digital computers existed." And... :eek: ... it still works.

(In fact, it can out-perform algorithms such as yours by a factor of thousands . . . )

let me try: so should i write to a new file with some delimiter, open the temp file using some split and then again wont i need to go to some ds to store the two dimensional values? the exception trace(around 100 lines) and the number of repetitions of that trace?

chrism01 10-14-2008 07:00 PM

Well, you could fork a number of copies, as each file seems to be treated separately.


All times are GMT -5. The time now is 04:01 PM.