Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game. |
| Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
 |
GNU/Linux Basic Guide
This 255-page guide will provide you with the keys to understand the philosophy of free software, teach you how to use and handle it, and give you the tools required to move easily in the world of GNU/Linux. Many users and administrators will be taking their first steps with this GNU/Linux Basic guide and it will show you how to approach and solve the problems you encounter.
Click Here to receive this Complete Guide absolutely free. |
|
 |
05-22-2008, 02:25 AM
|
#1
|
|
LQ Newbie
Registered: May 2008
Posts: 6
Rep:
|
Perl Script to select common lines in two files.
Hi,
I have two files: file1 file2, and I wish to extract lines in file2 that have a match between file1.
file1:
aaa
ccc
ddd
fff
file2:
aaa val=123
bbb val=345
ccc val=789
eee val=980
ggg val=901
I want to match the entire line of file1 with the first column (or until val= is found) in file2, and there is a match, print the entire file2 line. If file1 line is not in file2, or file2 first column is not in file1, the line is skipped from output.
In the above example, the output would be:
aaa val=123
ccc val=789
thanks.
|
|
|
|
05-22-2008, 03:26 AM
|
#2
|
|
Member
Registered: Nov 2004
Location: Turkey
Distribution: Slackware
Posts: 145
Rep:
|
Code:
egrep -f file1 file2
will do what you want. You can run it from perl using system
|
|
|
|
05-22-2008, 05:18 AM
|
#3
|
|
LQ Veteran
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 11,223
|
Maybe it's just me, but whenever I see a demand for a solution with such specific requirements (including the tool), I think "homework".
scoban has provided a (better) answer - why must it be perl ???.
|
|
|
|
05-22-2008, 10:32 AM
|
#4
|
|
LQ Newbie
Registered: May 2008
Posts: 6
Original Poster
Rep:
|
egrep -f file1 file2 doesn't do it for me
I had simplified the problem significantly in my example. When I try this on my sample data, I get:
grep: Invalid back reference
I suspect some of the lines in file1 are interpreted as "patterns", and the strings do not follow the normal grep-patterns. What I want is something that treats each line in file1 as an uninterpreted string, much like how 'comm -3 file1 file2' would do.
|
|
|
|
05-22-2008, 12:47 PM
|
#5
|
|
HCL Maintainer
Registered: Jan 2006
Distribution: (H)LFS, Gentoo
Posts: 2,450
Rep:
|
Quote:
Originally Posted by perluser59
What I want is something that treats each line in file1 as an uninterpreted string
|
You’re in luck! The tool is called fgrep (or grep -F) for fixed strings.
|
|
|
|
05-22-2008, 01:26 PM
|
#6
|
|
Member
Registered: Aug 2004
Location: Ohio
Distribution: Debian, Slackware
Posts: 58
Rep:
|
If the lines are in any sort of order you can do this in O(n). Increase the line counter in file1 when the current line of file2 is larger than the value of the current line in file2 (and visa-versa). This is the merge function from the merge-sort.
http://en.wikipedia.org/wiki/Merge_sort#Algorithm
If it's not sorted, you're looking at at least O(n^2).
|
|
|
|
05-22-2008, 03:33 PM
|
#7
|
|
LQ Newbie
Registered: May 2008
Posts: 6
Original Poster
Rep:
|
fgrep -f file1 file2 works.
However, I'm still looking to do this with a perl script, since file1 will have 7000 lines and file2 will have 6 million lines. I can sort both file1 and file2, so perl script can skip lines in file2, and perform it in O(n).
thanks
|
|
|
|
05-22-2008, 03:36 PM
|
#8
|
|
LQ Newbie
Registered: May 2008
Posts: 6
Original Poster
Rep:
|
New problem:
fgrep -f file1 file2
is not what I want, because pattern in file1 is searched every where, so the following is a failure scenario.
file1:
aaa
bbb
file2:
aaa 111
ccc bbb
You get output:
aaa 111
ccc bbb
but the second line should not be present, as the match is restricted to just the first field.
|
|
|
|
05-22-2008, 06:33 PM
|
#9
|
|
Guru
Registered: Aug 2004
Location: Brisbane
Distribution: Centos 6.4, Centos 5.9
Posts: 14,973
|
This seems to work for me, assuming fields are 1 space apart
Code:
#!/usr/bin/perl -w
use strict;
my (
$f1, $f2, @patterns, %patts, $f2_rec, $f2_field
);
$f1 = $ARGV[0];
$f2 = $ARGV[1];
open(PATT,"<", "$f1") or die "Unable to open $f1: $!\n";
@patterns = <PATT>;
chomp(@patterns);
close(PATT) or die "Unable to close $f1: $!\n";
@patts{@patterns} = (1) x @patterns;
open(F2,"<", "$f2") or die "Unable to open $f2: $!\n";
while ( defined ( $f2_rec = <F2> ) )
{
chomp $f2_rec; # newline
$f2_field = (split(/ /, $f2_rec))[0];
if( exists($patts{$f2_field}) )
{
print "$f2_rec\n";
}
}
close(F2) or die "Unable to close $f2: $!\n";
|
|
|
|
05-22-2008, 11:58 PM
|
#10
|
|
LQ Newbie
Registered: May 2008
Posts: 6
Original Poster
Rep:
|
Chrism01,
Thanks very much. This is exactly what I wanted. One minor error it reports after completing is:
Use of uninitialized value in exists at extract_common_lines.pl line 21, <F2> line 7.
It refers to line:
if( exists($patts{$f2_field}) )
It is patts that is not initialized?
|
|
|
|
05-23-2008, 12:08 AM
|
#11
|
|
LQ Newbie
Registered: May 2008
Posts: 6
Original Poster
Rep:
|
I guess if file2 has a blank line (the last line of file2 was a blank),
$f2_field is undefined. So, correcting that line to:
if( defined($f2_field) && exists($patts{$f2_field}) )
fixes the error. Thanks, again.
|
|
|
|
05-23-2008, 01:13 AM
|
#12
|
|
Guru
Registered: Aug 2004
Location: Brisbane
Distribution: Centos 6.4, Centos 5.9
Posts: 14,973
|
Yeah, you always have to check your data files carefully... Blank lines often appear at the end of hand edited files.
|
|
|
|
05-26-2008, 02:19 AM
|
#13
|
|
Member
Registered: Oct 2003
Distribution: Archlinux
Posts: 147
Rep:
|
Dunno if this is faster or not, but awk might be worth trying:
Code:
awk 'NR==FNR{a[$1];next} ($1 in a)' file1 file2
|
|
|
|
| Thread Tools |
Search this Thread |
|
|
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
All times are GMT -5. The time now is 04:07 PM.
|
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|