-   Programming (
-   -   Perl Script to select common lines in two files. (

perluser59 05-22-2008 03:25 AM

Perl Script to select common lines in two files.

I have two files: file1 file2, and I wish to extract lines in file2 that have a match between file1.




aaa val=123
bbb val=345
ccc val=789
eee val=980
ggg val=901

I want to match the entire line of file1 with the first column (or until val= is found) in file2, and there is a match, print the entire file2 line. If file1 line is not in file2, or file2 first column is not in file1, the line is skipped from output.

In the above example, the output would be:

aaa val=123
ccc val=789


scoban 05-22-2008 04:26 AM


egrep -f file1 file2
will do what you want. You can run it from perl using system

syg00 05-22-2008 06:18 AM

Maybe it's just me, but whenever I see a demand for a solution with such specific requirements (including the tool), I think "homework".

scoban has provided a (better) answer - why must it be perl ???.

perluser59 05-22-2008 11:32 AM

egrep -f file1 file2 doesn't do it for me
I had simplified the problem significantly in my example. When I try this on my sample data, I get:

grep: Invalid back reference

I suspect some of the lines in file1 are interpreted as "patterns", and the strings do not follow the normal grep-patterns. What I want is something that treats each line in file1 as an uninterpreted string, much like how 'comm -3 file1 file2' would do.

osor 05-22-2008 01:47 PM


Originally Posted by perluser59 (Post 3161466)
What I want is something that treats each line in file1 as an uninterpreted string

You’re in luck! The tool is called fgrep (or grep -F) for fixed strings.

derzok 05-22-2008 02:26 PM

If the lines are in any sort of order you can do this in O(n). Increase the line counter in file1 when the current line of file2 is larger than the value of the current line in file2 (and visa-versa). This is the merge function from the merge-sort.

If it's not sorted, you're looking at at least O(n^2).

perluser59 05-22-2008 04:33 PM

fgrep -f file1 file2 works.

However, I'm still looking to do this with a perl script, since file1 will have 7000 lines and file2 will have 6 million lines. I can sort both file1 and file2, so perl script can skip lines in file2, and perform it in O(n).


perluser59 05-22-2008 04:36 PM

New problem:

fgrep -f file1 file2

is not what I want, because pattern in file1 is searched every where, so the following is a failure scenario.


aaa 111
ccc bbb

You get output:

aaa 111
ccc bbb

but the second line should not be present, as the match is restricted to just the first field.

chrism01 05-22-2008 07:33 PM

This seems to work for me, assuming fields are 1 space apart


#!/usr/bin/perl -w
use strict;

my (
    $f1, $f2, @patterns, %patts, $f2_rec, $f2_field

$f1 = $ARGV[0];
$f2 = $ARGV[1];
open(PATT,"<", "$f1") or die "Unable to open $f1: $!\n";
@patterns = <PATT>;
close(PATT) or die "Unable to close $f1: $!\n";
@patts{@patterns} = (1) x @patterns;

open(F2,"<", "$f2") or die "Unable to open $f2: $!\n";
while ( defined ( $f2_rec = <F2> ) )
    chomp $f2_rec;                # newline
    $f2_field = (split(/ /, $f2_rec))[0];
    if( exists($patts{$f2_field}) )
        print "$f2_rec\n";
close(F2) or die "Unable to close $f2: $!\n";

perluser59 05-23-2008 12:58 AM


Thanks very much. This is exactly what I wanted. One minor error it reports after completing is:

Use of uninitialized value in exists at line 21, <F2> line 7.

It refers to line:
if( exists($patts{$f2_field}) )

It is patts that is not initialized?

perluser59 05-23-2008 01:08 AM

I guess if file2 has a blank line (the last line of file2 was a blank),
$f2_field is undefined. So, correcting that line to:

if( defined($f2_field) && exists($patts{$f2_field}) )

fixes the error. Thanks, again.

chrism01 05-23-2008 02:13 AM

Yeah, you always have to check your data files carefully... Blank lines often appear at the end of hand edited files.

angrybanana 05-26-2008 03:19 AM

Dunno if this is faster or not, but awk might be worth trying:

awk 'NR==FNR{a[$1];next} ($1 in a)' file1 file2

All times are GMT -5. The time now is 11:34 PM.