LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 05-22-2008, 03:25 AM   #1
perluser59
LQ Newbie
 
Registered: May 2008
Posts: 6

Rep: Reputation: 0
Perl Script to select common lines in two files.


Hi,

I have two files: file1 file2, and I wish to extract lines in file2 that have a match between file1.

file1:

aaa
ccc
ddd
fff

file2:

aaa val=123
bbb val=345
ccc val=789
eee val=980
ggg val=901

I want to match the entire line of file1 with the first column (or until val= is found) in file2, and there is a match, print the entire file2 line. If file1 line is not in file2, or file2 first column is not in file1, the line is skipped from output.

In the above example, the output would be:

aaa val=123
ccc val=789

thanks.
 
Old 05-22-2008, 04:26 AM   #2
scoban
Member
 
Registered: Nov 2004
Location: Turkey
Distribution: Slackware
Posts: 145

Rep: Reputation: 16
Code:
egrep -f file1 file2
will do what you want. You can run it from perl using system
 
Old 05-22-2008, 06:18 AM   #3
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 12,483

Rep: Reputation: 1077Reputation: 1077Reputation: 1077Reputation: 1077Reputation: 1077Reputation: 1077Reputation: 1077Reputation: 1077
Maybe it's just me, but whenever I see a demand for a solution with such specific requirements (including the tool), I think "homework".

scoban has provided a (better) answer - why must it be perl ???.
 
Old 05-22-2008, 11:32 AM   #4
perluser59
LQ Newbie
 
Registered: May 2008
Posts: 6

Original Poster
Rep: Reputation: 0
egrep -f file1 file2 doesn't do it for me

I had simplified the problem significantly in my example. When I try this on my sample data, I get:

grep: Invalid back reference

I suspect some of the lines in file1 are interpreted as "patterns", and the strings do not follow the normal grep-patterns. What I want is something that treats each line in file1 as an uninterpreted string, much like how 'comm -3 file1 file2' would do.
 
Old 05-22-2008, 01:47 PM   #5
osor
HCL Maintainer
 
Registered: Jan 2006
Distribution: (H)LFS, Gentoo
Posts: 2,450

Rep: Reputation: 70
Quote:
Originally Posted by perluser59 View Post
What I want is something that treats each line in file1 as an uninterpreted string
You’re in luck! The tool is called fgrep (or grep -F) for fixed strings.
 
Old 05-22-2008, 02:26 PM   #6
derzok
Member
 
Registered: Aug 2004
Location: Ohio
Distribution: Debian, Slackware
Posts: 58

Rep: Reputation: 15
If the lines are in any sort of order you can do this in O(n). Increase the line counter in file1 when the current line of file2 is larger than the value of the current line in file2 (and visa-versa). This is the merge function from the merge-sort.

http://en.wikipedia.org/wiki/Merge_sort#Algorithm

If it's not sorted, you're looking at at least O(n^2).
 
Old 05-22-2008, 04:33 PM   #7
perluser59
LQ Newbie
 
Registered: May 2008
Posts: 6

Original Poster
Rep: Reputation: 0
fgrep -f file1 file2 works.

However, I'm still looking to do this with a perl script, since file1 will have 7000 lines and file2 will have 6 million lines. I can sort both file1 and file2, so perl script can skip lines in file2, and perform it in O(n).

thanks
 
Old 05-22-2008, 04:36 PM   #8
perluser59
LQ Newbie
 
Registered: May 2008
Posts: 6

Original Poster
Rep: Reputation: 0
New problem:

fgrep -f file1 file2

is not what I want, because pattern in file1 is searched every where, so the following is a failure scenario.

file1:
aaa
bbb

file2:
aaa 111
ccc bbb

You get output:

aaa 111
ccc bbb

but the second line should not be present, as the match is restricted to just the first field.
 
Old 05-22-2008, 07:33 PM   #9
chrism01
Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.6, Centos 5.10
Posts: 16,324

Rep: Reputation: 2041Reputation: 2041Reputation: 2041Reputation: 2041Reputation: 2041Reputation: 2041Reputation: 2041Reputation: 2041Reputation: 2041Reputation: 2041Reputation: 2041
This seems to work for me, assuming fields are 1 space apart

Code:
#!/usr/bin/perl -w
use strict;

my (
    $f1, $f2, @patterns, %patts, $f2_rec, $f2_field
    );

$f1 = $ARGV[0];
$f2 = $ARGV[1];
open(PATT,"<", "$f1") or die "Unable to open $f1: $!\n";
@patterns = <PATT>;
chomp(@patterns);
close(PATT) or die "Unable to close $f1: $!\n";
@patts{@patterns} = (1) x @patterns;

open(F2,"<", "$f2") or die "Unable to open $f2: $!\n";
while ( defined ( $f2_rec = <F2> ) )
{
    chomp $f2_rec;                 # newline
    $f2_field = (split(/ /, $f2_rec))[0];
    if( exists($patts{$f2_field}) )
    {
        print "$f2_rec\n";
    }
}
close(F2) or die "Unable to close $f2: $!\n";
 
Old 05-23-2008, 12:58 AM   #10
perluser59
LQ Newbie
 
Registered: May 2008
Posts: 6

Original Poster
Rep: Reputation: 0
Chrism01,

Thanks very much. This is exactly what I wanted. One minor error it reports after completing is:

Use of uninitialized value in exists at extract_common_lines.pl line 21, <F2> line 7.


It refers to line:
if( exists($patts{$f2_field}) )

It is patts that is not initialized?
 
Old 05-23-2008, 01:08 AM   #11
perluser59
LQ Newbie
 
Registered: May 2008
Posts: 6

Original Poster
Rep: Reputation: 0
I guess if file2 has a blank line (the last line of file2 was a blank),
$f2_field is undefined. So, correcting that line to:

if( defined($f2_field) && exists($patts{$f2_field}) )

fixes the error. Thanks, again.
 
Old 05-23-2008, 02:13 AM   #12
chrism01
Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.6, Centos 5.10
Posts: 16,324

Rep: Reputation: 2041Reputation: 2041Reputation: 2041Reputation: 2041Reputation: 2041Reputation: 2041Reputation: 2041Reputation: 2041Reputation: 2041Reputation: 2041Reputation: 2041
Yeah, you always have to check your data files carefully... Blank lines often appear at the end of hand edited files.
 
Old 05-26-2008, 03:19 AM   #13
angrybanana
Member
 
Registered: Oct 2003
Distribution: Archlinux
Posts: 147

Rep: Reputation: 21
Dunno if this is faster or not, but awk might be worth trying:
Code:
awk 'NR==FNR{a[$1];next} ($1 in a)' file1 file2
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
To rename files in a directory should I use Bash script or a Perl Script ? jamtech Programming 7 01-23-2008 12:25 AM
NEED HELP IN comment lines PERL Perl script adam_blackice Programming 17 11-07-2007 09:01 AM
Shell script for comparing certain lines in two files mou5e Linux - Newbie 9 06-06-2007 02:40 PM
Perl script/mysql select query from a file mcdrr Programming 12 06-05-2007 01:00 AM
Script: splitting lines in multiple files and joining them timmay9162 Programming 28 04-14-2006 09:52 AM


All times are GMT -5. The time now is 01:54 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration