LinuxQuestions.org - Improving the performance !!!

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - Improving the performance !!! (https://www.linuxquestions.org/questions/programming-9/improving-the-performance-511977/)

Improving the performance !!!

Hi,

I am working with a small piece of code to find common elements between two files.

Code:

#! /opt/third-party/bin//perl



open(sf1,small) || die "couldn't open the file small file!";



$cnt = 1;

@file = ();



while ( $record1 = <sf1> ) {

  if( $cnt <= 1000 ) {

    push(@file, $record1);

    $cnt += 1;

  }



  else {

    $cnt = 1;

    open(tf1,big) || die "couldn't open the file <big file!>";

    while ( $record2 = <tf1> ) {

        foreach(@file) {

          if( $record2 eq $_ ) {

            print "Got it : $record2" ;

          }

        }

    }

    close(tf1);

    @file = ($record1);

  }

}

close(sf1);



print "\n";

exit 0;

and the another piece with grep,

Code:

#! /opt/third-party/bin//zsh



while read line

do

grep $line big 2>/dev/null 1>&2



if [ $? -eq 0 ]

then

echo $line

fi



done < small



exit 0

The performance of the grep tool seems to outperform that of the perl. Am suprised that for perl file contents are read to memory and comparison are made from the memory elements and I supposed that should be faster (perl)
But in all the samples of different records that I had run, grep seems to outperform the performance of perl.

Any way to improve the performance of perl?
Thanks for your inputs in advance !!!

If you want fast:

Code:

grep -f small big

Quote:

Originally Posted by jim mcnamara

If you want fast:

Code:

grep -f small big

another experiment that is related to the original post:

i noticed that if you do

Code:

egrep "(item1|item2)" file.lst

against a million record file, it is a big bottleneck compared to putting the items in the quotes into a small file.

why is that ?

For a start, use a hash instead of @file and say:

Code:

    while ( $record2 = <tf1> ) 

    {

        if( exists($file_hash{$record2} )

        {

            print "Got it : $record2" ;

        }

    }

http://perldoc.perl.org/search.html?q=hash+example

Quote:

Originally Posted by jim mcnamara

If you want fast:

Code:

grep -f small big

This is really slower than the 2 code snippets I had posted. Almost it takes 10 times the time taken with the other ones. This seems to be really slow.

Define seems.

Try running this script with time ./scriptname to give you real asnwers to performance times.

And you could be right - if small has thousands of lines in it, the regexp created could be snail slow. You can do things to "tune" a regexp. Are the contents of your "small" whole lines? Try prepending ^ to each line in small:

Code:

This is a line

^This is a line

Quote:

Originally Posted by jim mcnamara

Code:

This is a line

^This is a line

Thats great Jim,

prepending with ^ seems to improve the performance - better

>> grep -f small big

small - 1000
big - 24000

without prepending
>> time grep -f small big (16.647 sec)

after prepending ^
>> time grep -f small big (7.575 sec)

But whats the magic in prepending?
Is that regexp able to arrive straight at the pattern ^word,
could you please explain that?

many thanks once again!!!

^ matches the begining of a line. So, if it doesn't match directly afterwards, grep immediately skips to the next line without trying to with every possible string on the line. For example, an line with "foobar" matches both "foo" and "bar", and matches "^foo", but not "^bar". For "^bar", it sees that the first char is 'f', not 'b', so the line cannot match the pattern.

Quote:

Originally Posted by chrism01

For a start, use a hash instead of @file and say:

Code:

    while ( $record2 = <tf1> ) 

    {

        if( exists($file_hash{$record2} )

        {

            print "Got it : $record2" ;

        }

    }

http://perldoc.perl.org/search.html?q=hash+example

Thanks for the reply!
I had tried the following code.
But output seem to vary from other implementations.

perl code:

Code:

#! /opt/third-party/bin/perl



open(fh, "s") || die "unable to open the file <small>";



%fileHash = (-100, 'somejunk');

$i = 1;



while( $content = <fh> )

{

  if( $i <= 2 ) {

    $fileHash{($content)} = $i;

    $i++;

  }

  else {

    print "the count is $i\n";

    foreach $k1 ( sort keys (%fileHash) ) {

      print "key is $k1 and value is $fileHash{$k1}";

    }

    $i = 1;

    open(file, "b") || die "Unable to open the file <big>";

    while ( $rec = <file> ) {

        print "record is $rec\n";

        print "Got it:$rec" if exists $fileHash{$rec} ;

    }

    close(file);

    %fileHash = ();

    $fileHash{($content)} = $i;

  }

}

close(fh);



print "i val is $i\n";



open(file, "b") || die "Unable to open the file <big>";

while ( $rec = <file> ) {

  print "$rec" if exists $fileHash{$rec} ;

}

close(file);



%fileHash = ();



exit 0

Following are the sample files I had used.

Code:

>cat s

and then

code

adding

do

something extra here

i would

peculiardino

Code:

>cat b

adding

do

adding

code

do

i would

like to addd

do

i would

like to addd

more

with this i test

wow is that working

adding

and then

code

like to addd

more

with this i test

wow is that working

Code:

>perl file.pl | grep "Got it" | sort -u

Got it:adding

Got it:and then

Got it:code

Code:

>grep -f s b | sort -u

adding

and then

code

do

i would

What am actually trying is to emulate grep -f <file1> <file2> in perl code.

I dont see anything weird in the code.

a)Problem of both outputs not being same
b)Ability to perform sort -u within the perl code

Many thanks in advance :)

Rather with slight modification,

Code:

#! /opt/third-party/bin/perl



open(fh, "small") || die "unable to open the file <small>";



%fileHash = (-100, 'somejunk');



$i = 1;

while( chomp($content = <fh>) )

{

  $fileHash{$content} = $i;

  $i += 1;

}

close(fh);



open(file, "big") || die "Unable to open the file <big>";

while ( chomp($rec = <file>) ) {

  print "\nMatch:$rec" if exists $fileHash{$rec} ;

}

close(file);



%fileHash = ();



exit 0

But am afraid, whether I can create such large Hashes?

Still I need to figure out the reason why the previous implementation of perl is not working ??? :confused: