[SOLVED] efficient shell script to compare contents of files

si-thk · 09-05-2011, 05:04 AM

Hi, I need to compare lines of one file with that of another file. For example:

File1:
12 11 10 9 8 7 5 4 3
15 14 13 12 11 10 9 8 7 6 5 4 3
14 13 12 11 10 9 8 5
11 10 9 8 7
10 8 7 6 5 3

has to be compared with File2:
5 3 4
4 7 5 6
5 10

Giving an output File3:
1 0 1
1 1 1
0 0 1
0 0 0
0 0 1

I have written a shell script to do this, which works fine.

#!/bin/sh

> file3
while read line
do
>c.txt
echo $line| tr -s ' ' '\n' > line
while read line1
do
echo $line1| tr -s ' ' '\n'|sort -rn > line1
fgrep -vf line line1 > e.txt

FILE=e.txt
if [ -s $FILE ] ; then
echo "0" >> c.txt
else
echo "1" >> c.txt
fi
done < file2
cat c.txt|awk '{printf ($1 " " )}'>> file3
echo >> file3
done < file1

The problem with this code is that it take quite a long time when the number of entires in file 1 and 2 gets the order of thousands. Any suggestion for efficient execution is deeply appreciated.

Thanks in advance.

grail · 09-05-2011, 05:12 AM

Could you perhaps first explain how the output is generated? ... ie how does 12 11 10 9 8 7 5 4 3 and 5 3 4 translate to 1 0 1?

si-thk · 09-05-2011, 06:00 AM

Thanks for the reply. What compares in two files is the lines of file1 with lines of file2. If the line 1 of file 2 (5 3 4) is completely embed in line1 of file1 then in the first place (1,1) of file3 is 1 (means, yes 5 3 4 is included in 12 11 10 9 8 7 5 4 3). The next 0 in 1 0 1 is the match of 12 11 10 9 8 7 5 4 3 and 4 7 5 6 (which is no, and hence 0 in the place (1,2) ). The 1 in place (1,3) of file3 says 5 10 is in 12 11 10 9 8 7 5 4 3. That is how the line 1 0 1 in file 3 comes. The 2nd line in file3 comes by comparing the 2nd line of file1 with the 1st 2nd and 3rd lines of file2. and the 3rd line of file 3 comes from the 3rd line of file1 with 1,2,3 of file2, and so on.

cheddarcheese · 09-05-2011, 07:03 AM

I'm a bit confused. You said, "...If the line 1 of file 2 (5 3 4) is completely embed in line1 of file1 then in the first place (1,1) of file3 is 1 ...."

Well, "5 3 4" is not completely embedded in line 1 of file 1 (although "5 4 3" is), and yet the position (1,1) in file 3 is still a 1. Or do you mean if 5 AND 3 AND 4 are all, separately, identified in line 1 of file 1?

si-thk · 09-05-2011, 07:27 AM

I'm sorry if my description confuses; yes, 5 3 4 are different words so to say ( the words are separated by space). so, the set 5 and 3 and 4 are included in the set 12 and 11 and 10 and 9 and 8 and 7 and 5 and 4 and 3.

cheddarcheese · 09-05-2011, 07:54 AM

Okay, I get it (I think).

Well, not sure if you must do it as a shell script, but if you are able to do it with Perl, then you might do something like the following. I get the same results as you do in file 3 when I run this script. A couple of things though.

a) the ~~ operator (smart-match operator) is only available in more recent versions of Perl, so on an older version you would need to amend that a bit.
b) Not entirely sure without testing it a bit more if, with the ~~ operator it may match as positive a "1" in file 2, against a "12" (for example) in file 1 (which is not what you want, I know) ... in which case the smart operator line would need to be amended somehow anyway.
c) The script currently just prints to standard output, but can easily be amended to print to file 3
d) Rather than opening file 2 on each loop, it would be faster to read it in once, into a hash, and then loop the hash values ... but that shouldn't be too difficult either.

Code:

#!/usr/bin/perl
#
use strict;

open(F1,'file1.txt');

while (my $f1line=<F1>)
{
chomp($f1line);
my @f1nums=split(/ /,$f1line);

        open(F2,'file2.txt');

        while(my $f2line=<F2>)
        {
        chomp($f2line);
        my @f2nums=split(/ /,$f2line);

        my $flag=1;

                for (@f2nums)
                {
                if (!($_~~@f1nums))
                {
                $flag=0;
                last;
                }
                }

        print "$flag\t";
        }
        print "\n";
}

Anyway, not sure if that is in any way helpful, but it was an interesting little problem :-)

si-thk · 09-05-2011, 08:35 AM

Thanks for the reply and the code. Shell script is not a must for me, but I used it just because unsorted files with tens of thousands of words could very well be handled by them. BTW, your script seems attractive. I'll see whether it works well for long files. Thanks also for tip of hashing file2.

cheddarcheese · 09-05-2011, 08:43 AM

Sure, no worries ... But do take note of point (b) which I mentioned, because I'm really not 100% certain how the smart-match operator (~~) performs matches such as the one I suggested. You obviously only want a "1" to match against another "1", and not a "10", "11" or "12", for example. It may work the way you want, I'm just not certain. (although that line can probably be replaced with some other regex if required). But, anyway, you get the idea I'm sure ... Good luck!

si-thk · 09-05-2011, 09:39 AM

It works fine for part of the words also. No positive for 1 with 12, 11 and so on. Thanks.

kurumi · 09-05-2011, 10:25 AM

If you have Ruby(1.9+)

Code:

#!/usr/bin/env ruby
file2=File.open("file2").readlines.map!(&:split)
File.open("file1").each do |line|
  file1 = line.split
  file2.each{|f2| print f2.all?{|x| file1.include?(x)}  ? "1" :"0"}
  puts
end

test run:

Code:

$ ruby test.rb
101
111
001
000
001

allend · 09-05-2011, 11:11 AM

An interesting exercise for which I came up with this bash solution.

Code:

#!/bin/bash

rm OUTPUT.TXT
while read LINE1  ; do
   while read -a SEARCHTERMS ; do
    for (( i=1;i <= ${#SEARCHTERMS[@]};i++ )); do
      if echo $LINE1 | grep -q "${SEARCHTERMS[i-1]}" ; then
        if [ "$i" -eq "${#SEARCHTERMS[@]}" ] ; then
          echo -n "1 " >> OUTPUT.TXT;
        fi
      else
        echo -n "0 " >> OUTPUT.TXT;
        break;
      fi
    done
  done < FILE2
  echo "" >> OUTPUT.TXT
done < FILE1

Avoiding the use of sort should help with speed. I am also confident that a real bash expert could improve this!

grail · 09-05-2011, 09:33 PM

Well I am not sure if improved, but at least slightly different:

Code:

exec > FILE3
while read -r line
do
    while read -a arr
    do
        ans=1
        for i in ${arr[*]}
        do
            reg="\<$i\>"
            [[ "$line" =~ $reg ]] || ans=0
        done
        echo -n "$ans "
    done<FILE2
    echo
done<FILE1

si-thk · 09-06-2011, 01:52 AM

Thanks everybody. It seems all the 3 codes are efficient than mine. have to try for very large files. Thanks again for the help.

allend · 09-06-2011, 07:59 AM

@grail - Very neat! Thanks for the instructive demonstration on the use of shell binary operator with a regular expression. Obviously much cleaner than calling grep.
Perhaps retaining the break in the innermost loop would be more efficient. It only takes one missing match to cause a zero to be written to the output file, so there is no need to test all possibilities.