[SOLVED] efficient shell script to compare contents of files
Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
The problem with this code is that it take quite a long time when the number of entires in file 1 and 2 gets the order of thousands. Any suggestion for efficient execution is deeply appreciated.
Thanks for the reply. What compares in two files is the lines of file1 with lines of file2. If the line 1 of file 2 (5 3 4) is completely embed in line1 of file1 then in the first place (1,1) of file3 is 1 (means, yes 5 3 4 is included in 12 11 10 9 8 7 5 4 3). The next 0 in 1 0 1 is the match of 12 11 10 9 8 7 5 4 3 and 4 7 5 6 (which is no, and hence 0 in the place (1,2) ). The 1 in place (1,3) of file3 says 5 10 is in 12 11 10 9 8 7 5 4 3. That is how the line 1 0 1 in file 3 comes. The 2nd line in file3 comes by comparing the 2nd line of file1 with the 1st 2nd and 3rd lines of file2. and the 3rd line of file 3 comes from the 3rd line of file1 with 1,2,3 of file2, and so on.
I'm a bit confused. You said, "...If the line 1 of file 2 (5 3 4) is completely embed in line1 of file1 then in the first place (1,1) of file3 is 1 ...."
Well, "5 3 4" is not completely embedded in line 1 of file 1 (although "5 4 3" is), and yet the position (1,1) in file 3 is still a 1. Or do you mean if 5 AND 3 AND 4 are all, separately, identified in line 1 of file 1?
I'm sorry if my description confuses; yes, 5 3 4 are different words so to say ( the words are separated by space). so, the set 5 and 3 and 4 are included in the set 12 and 11 and 10 and 9 and 8 and 7 and 5 and 4 and 3.
Well, not sure if you must do it as a shell script, but if you are able to do it with Perl, then you might do something like the following. I get the same results as you do in file 3 when I run this script. A couple of things though.
a) the ~~ operator (smart-match operator) is only available in more recent versions of Perl, so on an older version you would need to amend that a bit.
b) Not entirely sure without testing it a bit more if, with the ~~ operator it may match as positive a "1" in file 2, against a "12" (for example) in file 1 (which is not what you want, I know) ... in which case the smart operator line would need to be amended somehow anyway.
c) The script currently just prints to standard output, but can easily be amended to print to file 3
d) Rather than opening file 2 on each loop, it would be faster to read it in once, into a hash, and then loop the hash values ... but that shouldn't be too difficult either.
Code:
#!/usr/bin/perl
#
use strict;
open(F1,'file1.txt');
while (my $f1line=<F1>)
{
chomp($f1line);
my @f1nums=split(/ /,$f1line);
open(F2,'file2.txt');
while(my $f2line=<F2>)
{
chomp($f2line);
my @f2nums=split(/ /,$f2line);
my $flag=1;
for (@f2nums)
{
if (!($_~~@f1nums))
{
$flag=0;
last;
}
}
print "$flag\t";
}
print "\n";
}
Anyway, not sure if that is in any way helpful, but it was an interesting little problem :-)
Last edited by cheddarcheese; 09-05-2011 at 07:58 AM.
Thanks for the reply and the code. Shell script is not a must for me, but I used it just because unsorted files with tens of thousands of words could very well be handled by them. BTW, your script seems attractive. I'll see whether it works well for long files. Thanks also for tip of hashing file2.
Sure, no worries ... But do take note of point (b) which I mentioned, because I'm really not 100% certain how the smart-match operator (~~) performs matches such as the one I suggested. You obviously only want a "1" to match against another "1", and not a "10", "11" or "12", for example. It may work the way you want, I'm just not certain. (although that line can probably be replaced with some other regex if required). But, anyway, you get the idea I'm sure ... Good luck!
Last edited by cheddarcheese; 09-05-2011 at 09:13 AM.
An interesting exercise for which I came up with this bash solution.
Code:
#!/bin/bash
rm OUTPUT.TXT
while read LINE1 ; do
while read -a SEARCHTERMS ; do
for (( i=1;i <= ${#SEARCHTERMS[@]};i++ )); do
if echo $LINE1 | grep -q "${SEARCHTERMS[i-1]}" ; then
if [ "$i" -eq "${#SEARCHTERMS[@]}" ] ; then
echo -n "1 " >> OUTPUT.TXT;
fi
else
echo -n "0 " >> OUTPUT.TXT;
break;
fi
done
done < FILE2
echo "" >> OUTPUT.TXT
done < FILE1
Avoiding the use of sort should help with speed. I am also confident that a real bash expert could improve this!
Well I am not sure if improved, but at least slightly different:
Code:
exec > FILE3
while read -r line
do
while read -a arr
do
ans=1
for i in ${arr[*]}
do
reg="\<$i\>"
[[ "$line" =~ $reg ]] || ans=0
done
echo -n "$ans "
done<FILE2
echo
done<FILE1
@grail - Very neat! Thanks for the instructive demonstration on the use of shell binary operator with a regular expression. Obviously much cleaner than calling grep.
Perhaps retaining the break in the innermost loop would be more efficient. It only takes one missing match to cause a zero to be written to the output file, so there is no need to test all possibilities.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.