LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (http://www.linuxquestions.org/questions/programming-9/)
-   -   Perl Script to select common lines in two files. (http://www.linuxquestions.org/questions/programming-9/perl-script-to-select-common-lines-in-two-files-643962/)

perluser59 05-22-2008 02:25 AM

Perl Script to select common lines in two files.
 
Hi,

I have two files: file1 file2, and I wish to extract lines in file2 that have a match between file1.

file1:

aaa
ccc
ddd
fff

file2:

aaa val=123
bbb val=345
ccc val=789
eee val=980
ggg val=901

I want to match the entire line of file1 with the first column (or until val= is found) in file2, and there is a match, print the entire file2 line. If file1 line is not in file2, or file2 first column is not in file1, the line is skipped from output.

In the above example, the output would be:

aaa val=123
ccc val=789

thanks.

scoban 05-22-2008 03:26 AM

Code:

egrep -f file1 file2
will do what you want. You can run it from perl using system

syg00 05-22-2008 05:18 AM

Maybe it's just me, but whenever I see a demand for a solution with such specific requirements (including the tool), I think "homework".

scoban has provided a (better) answer - why must it be perl ???.

perluser59 05-22-2008 10:32 AM

egrep -f file1 file2 doesn't do it for me
 
I had simplified the problem significantly in my example. When I try this on my sample data, I get:

grep: Invalid back reference

I suspect some of the lines in file1 are interpreted as "patterns", and the strings do not follow the normal grep-patterns. What I want is something that treats each line in file1 as an uninterpreted string, much like how 'comm -3 file1 file2' would do.

osor 05-22-2008 12:47 PM

Quote:

Originally Posted by perluser59 (Post 3161466)
What I want is something that treats each line in file1 as an uninterpreted string

You’re in luck! The tool is called fgrep (or grep -F) for fixed strings.

derzok 05-22-2008 01:26 PM

If the lines are in any sort of order you can do this in O(n). Increase the line counter in file1 when the current line of file2 is larger than the value of the current line in file2 (and visa-versa). This is the merge function from the merge-sort.

http://en.wikipedia.org/wiki/Merge_sort#Algorithm

If it's not sorted, you're looking at at least O(n^2).

perluser59 05-22-2008 03:33 PM

fgrep -f file1 file2 works.

However, I'm still looking to do this with a perl script, since file1 will have 7000 lines and file2 will have 6 million lines. I can sort both file1 and file2, so perl script can skip lines in file2, and perform it in O(n).

thanks

perluser59 05-22-2008 03:36 PM

New problem:

fgrep -f file1 file2

is not what I want, because pattern in file1 is searched every where, so the following is a failure scenario.

file1:
aaa
bbb

file2:
aaa 111
ccc bbb

You get output:

aaa 111
ccc bbb

but the second line should not be present, as the match is restricted to just the first field.

chrism01 05-22-2008 06:33 PM

This seems to work for me, assuming fields are 1 space apart

Code:

#!/usr/bin/perl -w
use strict;

my (
    $f1, $f2, @patterns, %patts, $f2_rec, $f2_field
    );

$f1 = $ARGV[0];
$f2 = $ARGV[1];
open(PATT,"<", "$f1") or die "Unable to open $f1: $!\n";
@patterns = <PATT>;
chomp(@patterns);
close(PATT) or die "Unable to close $f1: $!\n";
@patts{@patterns} = (1) x @patterns;

open(F2,"<", "$f2") or die "Unable to open $f2: $!\n";
while ( defined ( $f2_rec = <F2> ) )
{
    chomp $f2_rec;                # newline
    $f2_field = (split(/ /, $f2_rec))[0];
    if( exists($patts{$f2_field}) )
    {
        print "$f2_rec\n";
    }
}
close(F2) or die "Unable to close $f2: $!\n";


perluser59 05-22-2008 11:58 PM

Chrism01,

Thanks very much. This is exactly what I wanted. One minor error it reports after completing is:

Use of uninitialized value in exists at extract_common_lines.pl line 21, <F2> line 7.


It refers to line:
if( exists($patts{$f2_field}) )

It is patts that is not initialized?

perluser59 05-23-2008 12:08 AM

I guess if file2 has a blank line (the last line of file2 was a blank),
$f2_field is undefined. So, correcting that line to:

if( defined($f2_field) && exists($patts{$f2_field}) )

fixes the error. Thanks, again.

chrism01 05-23-2008 01:13 AM

Yeah, you always have to check your data files carefully... Blank lines often appear at the end of hand edited files.

angrybanana 05-26-2008 02:19 AM

Dunno if this is faster or not, but awk might be worth trying:
Code:

awk 'NR==FNR{a[$1];next} ($1 in a)' file1 file2


All times are GMT -5. The time now is 09:40 AM.