Improving the performance !!!

kshkid · 12-19-2006, 12:19 PM

Hi,

I am working with a small piece of code to find common elements between two files.

Code:

#! /opt/third-party/bin//perl

open(sf1,small) || die "couldn't open the file small file!";

$cnt = 1;
@file = ();

while ( $record1 = <sf1> ) {
  if( $cnt <= 1000 ) {
    push(@file, $record1);
    $cnt += 1;
  }

  else {
    $cnt = 1;
    open(tf1,big) || die "couldn't open the file <big file!>";
    while ( $record2 = <tf1> ) {
        foreach(@file) {
          if( $record2 eq $_ ) {
            print "Got it : $record2" ;
          }
        }
    }
    close(tf1);
    @file = ($record1);
  }
}
close(sf1);

print "\n";
exit 0;

and the another piece with grep,

Code:

#! /opt/third-party/bin//zsh

while read line
do
grep $line big 2>/dev/null 1>&2

if [ $? -eq 0 ]
then
echo $line
fi

done < small

exit 0

The performance of the grep tool seems to outperform that of the perl. Am suprised that for perl file contents are read to memory and comparison are made from the memory elements and I supposed that should be faster (perl)
But in all the samples of different records that I had run, grep seems to outperform the performance of perl.

Any way to improve the performance of perl?
Thanks for your inputs in advance !!!

jim mcnamara · 12-19-2006, 12:48 PM

If you want fast:

Code:

grep -f small big

schneidz · 12-19-2006, 02:22 PM

Quote:

Originally Posted by jim mcnamara

If you want fast:

Code:

grep -f small big

another experiment that is related to the original post:

i noticed that if you do

Code:

egrep "(item1|item2)" file.lst

against a million record file, it is a big bottleneck compared to putting the items in the quotes into a small file.

why is that ?

chrism01 · 12-19-2006, 05:09 PM

For a start, use a hash instead of @file and say:

Code:

    while ( $record2 = <tf1> ) 
    {
        if( exists($file_hash{$record2} )
        {
            print "Got it : $record2" ;
        }
    }

http://perldoc.perl.org/search.html?q=hash+example

kshkid · 12-20-2006, 03:01 AM

Quote:

Originally Posted by jim mcnamara

If you want fast:

Code:

grep -f small big

This is really slower than the 2 code snippets I had posted. Almost it takes 10 times the time taken with the other ones. This seems to be really slow.

jim mcnamara · 12-20-2006, 10:21 AM

Define seems.

Try running this script with time ./scriptname to give you real asnwers to performance times.

And you could be right - if small has thousands of lines in it, the regexp created could be snail slow. You can do things to "tune" a regexp. Are the contents of your "small" whole lines? Try prepending ^ to each line in small:

Code:

This is a line
^This is a line

kshkid · 12-20-2006, 10:46 AM

Quote:

Originally Posted by jim mcnamara

Define seems.

Try running this script with time ./scriptname to give you real asnwers to performance times.

And you could be right - if small has thousands of lines in it, the regexp created could be snail slow. You can do things to "tune" a regexp. Are the contents of your "small" whole lines? Try prepending ^ to each line in small:

Code:

This is a line
^This is a line

Thats great Jim,

prepending with ^ seems to improve the performance - better

>> grep -f small big

small - 1000
big - 24000

without prepending
>> time grep -f small big (16.647 sec)

after prepending ^
>> time grep -f small big (7.575 sec)

But whats the magic in prepending?
Is that regexp able to arrive straight at the pattern ^word,
could you please explain that?

many thanks once again!!!

tuxdev · 12-20-2006, 11:04 AM

^ matches the begining of a line. So, if it doesn't match directly afterwards, grep immediately skips to the next line without trying to with every possible string on the line. For example, an line with "foobar" matches both "foo" and "bar", and matches "^foo", but not "^bar". For "^bar", it sees that the first char is 'f', not 'b', so the line cannot match the pattern.

kshkid · 12-20-2006, 11:12 AM

Quote:

Originally Posted by chrism01

For a start, use a hash instead of @file and say:

Code:

    while ( $record2 = <tf1> ) 
    {
        if( exists($file_hash{$record2} )
        {
            print "Got it : $record2" ;
        }
    }

http://perldoc.perl.org/search.html?q=hash+example

Thanks for the reply!
I had tried the following code.
But output seem to vary from other implementations.

perl code:

Code:

#! /opt/third-party/bin/perl

open(fh, "s") || die "unable to open the file <small>";

%fileHash = (-100, 'somejunk');
$i = 1;

while( $content = <fh> )
{
  if( $i <= 2 ) {
    $fileHash{($content)} = $i;
    $i++;
  }
  else {
    print "the count is $i\n";
    foreach $k1 ( sort keys (%fileHash) ) {
      print "key is $k1 and value is $fileHash{$k1}";
    }
    $i = 1;
    open(file, "b") || die "Unable to open the file <big>";
    while ( $rec = <file> ) {
        print "record is $rec\n";
        print "Got it:$rec" if exists $fileHash{$rec} ;
    }
    close(file);
    %fileHash = ();
    $fileHash{($content)} = $i;
  }
}
close(fh);

print "i val is $i\n";

open(file, "b") || die "Unable to open the file <big>";
while ( $rec = <file> ) {
  print "$rec" if exists $fileHash{$rec} ;
}
close(file);

%fileHash = ();

exit 0

Following are the sample files I had used.

Code:

>cat s
and then
code
adding
do
something extra here
i would
peculiardino

Code:

>cat b
adding
do
adding
code
do
i would
like to addd
do
i would
like to addd
more
with this i test
wow is that working
adding
and then
code
like to addd
more
with this i test
wow is that working

Code:

>perl file.pl | grep "Got it" | sort -u
Got it:adding
Got it:and then
Got it:code

Code:

>grep -f s b | sort -u
adding
and then
code
do
i would

What am actually trying is to emulate grep -f <file1> <file2> in perl code.

I dont see anything weird in the code.

a)Problem of both outputs not being same
b)Ability to perform sort -u within the perl code

Many thanks in advance

kshkid · 12-20-2006, 12:23 PM

Rather with slight modification,

Code:

#! /opt/third-party/bin/perl

open(fh, "small") || die "unable to open the file <small>";

%fileHash = (-100, 'somejunk');

$i = 1;
while( chomp($content = <fh>) )
{
  $fileHash{$content} = $i;
  $i += 1;
}
close(fh);

open(file, "big") || die "Unable to open the file <big>";
while ( chomp($rec = <file>) ) {
  print "\nMatch:$rec" if exists $fileHash{$rec} ;
}
close(file);

%fileHash = ();

exit 0

But am afraid, whether I can create such large Hashes?

Still I need to figure out the reason why the previous implementation of perl is not working ???