LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   efficient shell script to compare contents of files (https://www.linuxquestions.org/questions/linux-newbie-8/efficient-shell-script-to-compare-contents-of-files-901197/)

si-thk 09-05-2011 05:04 AM

efficient shell script to compare contents of files
 
Hi, I need to compare lines of one file with that of another file. For example:

File1:
12 11 10 9 8 7 5 4 3
15 14 13 12 11 10 9 8 7 6 5 4 3
14 13 12 11 10 9 8 5
11 10 9 8 7
10 8 7 6 5 3

has to be compared with File2:
5 3 4
4 7 5 6
5 10

Giving an output File3:
1 0 1
1 1 1
0 0 1
0 0 0
0 0 1

I have written a shell script to do this, which works fine.


#!/bin/sh

> file3
while read line
do
>c.txt
echo $line| tr -s ' ' '\n' > line
while read line1
do
echo $line1| tr -s ' ' '\n'|sort -rn > line1
fgrep -vf line line1 > e.txt

FILE=e.txt
if [ -s $FILE ] ; then
echo "0" >> c.txt
else
echo "1" >> c.txt
fi
done < file2
cat c.txt|awk '{printf ($1 " " )}'>> file3
echo >> file3
done < file1

The problem with this code is that it take quite a long time when the number of entires in file 1 and 2 gets the order of thousands. Any suggestion for efficient execution is deeply appreciated.

Thanks in advance.

grail 09-05-2011 05:12 AM

Could you perhaps first explain how the output is generated? ... ie how does 12 11 10 9 8 7 5 4 3 and 5 3 4 translate to 1 0 1?

si-thk 09-05-2011 06:00 AM

Thanks for the reply. What compares in two files is the lines of file1 with lines of file2. If the line 1 of file 2 (5 3 4) is completely embed in line1 of file1 then in the first place (1,1) of file3 is 1 (means, yes 5 3 4 is included in 12 11 10 9 8 7 5 4 3). The next 0 in 1 0 1 is the match of 12 11 10 9 8 7 5 4 3 and 4 7 5 6 (which is no, and hence 0 in the place (1,2) ). The 1 in place (1,3) of file3 says 5 10 is in 12 11 10 9 8 7 5 4 3. That is how the line 1 0 1 in file 3 comes. The 2nd line in file3 comes by comparing the 2nd line of file1 with the 1st 2nd and 3rd lines of file2. and the 3rd line of file 3 comes from the 3rd line of file1 with 1,2,3 of file2, and so on.

cheddarcheese 09-05-2011 07:03 AM

I'm a bit confused. You said, "...If the line 1 of file 2 (5 3 4) is completely embed in line1 of file1 then in the first place (1,1) of file3 is 1 ...."

Well, "5 3 4" is not completely embedded in line 1 of file 1 (although "5 4 3" is), and yet the position (1,1) in file 3 is still a 1. Or do you mean if 5 AND 3 AND 4 are all, separately, identified in line 1 of file 1?

si-thk 09-05-2011 07:27 AM

I'm sorry if my description confuses; yes, 5 3 4 are different words so to say ( the words are separated by space). so, the set 5 and 3 and 4 are included in the set 12 and 11 and 10 and 9 and 8 and 7 and 5 and 4 and 3.

cheddarcheese 09-05-2011 07:54 AM

Okay, I get it (I think).

Well, not sure if you must do it as a shell script, but if you are able to do it with Perl, then you might do something like the following. I get the same results as you do in file 3 when I run this script. A couple of things though.

a) the ~~ operator (smart-match operator) is only available in more recent versions of Perl, so on an older version you would need to amend that a bit.
b) Not entirely sure without testing it a bit more if, with the ~~ operator it may match as positive a "1" in file 2, against a "12" (for example) in file 1 (which is not what you want, I know) ... in which case the smart operator line would need to be amended somehow anyway.
c) The script currently just prints to standard output, but can easily be amended to print to file 3
d) Rather than opening file 2 on each loop, it would be faster to read it in once, into a hash, and then loop the hash values ... but that shouldn't be too difficult either.

Code:

#!/usr/bin/perl
#
use strict;

open(F1,'file1.txt');

while (my $f1line=<F1>)
{
chomp($f1line);
my @f1nums=split(/ /,$f1line);

        open(F2,'file2.txt');

        while(my $f2line=<F2>)
        {
        chomp($f2line);
        my @f2nums=split(/ /,$f2line);

        my $flag=1;

                for (@f2nums)
                {
                if (!($_~~@f1nums))
                {
                $flag=0;
                last;
                }
                }

        print "$flag\t";
        }
        print "\n";
}

Anyway, not sure if that is in any way helpful, but it was an interesting little problem :-)

si-thk 09-05-2011 08:35 AM

Thanks for the reply and the code. Shell script is not a must for me, but I used it just because unsorted files with tens of thousands of words could very well be handled by them. BTW, your script seems attractive. I'll see whether it works well for long files. Thanks also for tip of hashing file2.

cheddarcheese 09-05-2011 08:43 AM

Sure, no worries ... But do take note of point (b) which I mentioned, because I'm really not 100% certain how the smart-match operator (~~) performs matches such as the one I suggested. You obviously only want a "1" to match against another "1", and not a "10", "11" or "12", for example. It may work the way you want, I'm just not certain. (although that line can probably be replaced with some other regex if required). But, anyway, you get the idea I'm sure ... Good luck!

si-thk 09-05-2011 09:39 AM

It works fine for part of the words also. No positive for 1 with 12, 11 and so on. Thanks.

kurumi 09-05-2011 10:25 AM

If you have Ruby(1.9+)

Code:

#!/usr/bin/env ruby
file2=File.open("file2").readlines.map!(&:split)
File.open("file1").each do |line|
  file1 = line.split
  file2.each{|f2| print f2.all?{|x| file1.include?(x)}  ? "1" :"0"}
  puts
end

test run:
Code:

$ ruby test.rb
101
111
001
000
001


allend 09-05-2011 11:11 AM

An interesting exercise for which I came up with this bash solution.
Code:

#!/bin/bash

rm OUTPUT.TXT
while read LINE1  ; do
  while read -a SEARCHTERMS ; do
    for (( i=1;i <= ${#SEARCHTERMS[@]};i++ )); do
      if echo $LINE1 | grep -q "${SEARCHTERMS[i-1]}" ; then
        if [ "$i" -eq "${#SEARCHTERMS[@]}" ] ; then
          echo -n "1 " >> OUTPUT.TXT;
        fi
      else
        echo -n "0 " >> OUTPUT.TXT;
        break;
      fi
    done
  done < FILE2
  echo "" >> OUTPUT.TXT
done < FILE1

Avoiding the use of sort should help with speed. I am also confident that a real bash expert could improve this!

grail 09-05-2011 09:33 PM

Well I am not sure if improved, but at least slightly different:
Code:

exec > FILE3
while read -r line
do
    while read -a arr
    do
        ans=1
        for i in ${arr[*]}
        do
            reg="\<$i\>"
            [[ "$line" =~ $reg ]] || ans=0
        done
        echo -n "$ans "
    done<FILE2
    echo
done<FILE1


si-thk 09-06-2011 01:52 AM

Thanks everybody. It seems all the 3 codes are efficient than mine. have to try for very large files. Thanks again for the help.

allend 09-06-2011 07:59 AM

@grail - Very neat! Thanks for the instructive demonstration on the use of shell binary operator with a regular expression. Obviously much cleaner than calling grep.
Perhaps retaining the break in the innermost loop would be more efficient. It only takes one missing match to cause a zero to be written to the output file, so there is no need to test all possibilities.


All times are GMT -5. The time now is 11:39 AM.