LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 09-05-2011, 06:04 AM   #1
si-thk
LQ Newbie
 
Registered: Mar 2010
Posts: 8

Rep: Reputation: 0
efficient shell script to compare contents of files


Hi, I need to compare lines of one file with that of another file. For example:

File1:
12 11 10 9 8 7 5 4 3
15 14 13 12 11 10 9 8 7 6 5 4 3
14 13 12 11 10 9 8 5
11 10 9 8 7
10 8 7 6 5 3

has to be compared with File2:
5 3 4
4 7 5 6
5 10

Giving an output File3:
1 0 1
1 1 1
0 0 1
0 0 0
0 0 1

I have written a shell script to do this, which works fine.


#!/bin/sh

> file3
while read line
do
>c.txt
echo $line| tr -s ' ' '\n' > line
while read line1
do
echo $line1| tr -s ' ' '\n'|sort -rn > line1
fgrep -vf line line1 > e.txt

FILE=e.txt
if [ -s $FILE ] ; then
echo "0" >> c.txt
else
echo "1" >> c.txt
fi
done < file2
cat c.txt|awk '{printf ($1 " " )}'>> file3
echo >> file3
done < file1

The problem with this code is that it take quite a long time when the number of entires in file 1 and 2 gets the order of thousands. Any suggestion for efficient execution is deeply appreciated.

Thanks in advance.
 
Old 09-05-2011, 06:12 AM   #2
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,246

Rep: Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684
Could you perhaps first explain how the output is generated? ... ie how does 12 11 10 9 8 7 5 4 3 and 5 3 4 translate to 1 0 1?
 
Old 09-05-2011, 07:00 AM   #3
si-thk
LQ Newbie
 
Registered: Mar 2010
Posts: 8

Original Poster
Rep: Reputation: 0
Thanks for the reply. What compares in two files is the lines of file1 with lines of file2. If the line 1 of file 2 (5 3 4) is completely embed in line1 of file1 then in the first place (1,1) of file3 is 1 (means, yes 5 3 4 is included in 12 11 10 9 8 7 5 4 3). The next 0 in 1 0 1 is the match of 12 11 10 9 8 7 5 4 3 and 4 7 5 6 (which is no, and hence 0 in the place (1,2) ). The 1 in place (1,3) of file3 says 5 10 is in 12 11 10 9 8 7 5 4 3. That is how the line 1 0 1 in file 3 comes. The 2nd line in file3 comes by comparing the 2nd line of file1 with the 1st 2nd and 3rd lines of file2. and the 3rd line of file 3 comes from the 3rd line of file1 with 1,2,3 of file2, and so on.
 
Old 09-05-2011, 08:03 AM   #4
cheddarcheese
Member
 
Registered: Aug 2011
Location: Massachusetts, USA
Distribution: Fedora; Centos; Puppy
Posts: 82

Rep: Reputation: 5
I'm a bit confused. You said, "...If the line 1 of file 2 (5 3 4) is completely embed in line1 of file1 then in the first place (1,1) of file3 is 1 ...."

Well, "5 3 4" is not completely embedded in line 1 of file 1 (although "5 4 3" is), and yet the position (1,1) in file 3 is still a 1. Or do you mean if 5 AND 3 AND 4 are all, separately, identified in line 1 of file 1?
 
Old 09-05-2011, 08:27 AM   #5
si-thk
LQ Newbie
 
Registered: Mar 2010
Posts: 8

Original Poster
Rep: Reputation: 0
I'm sorry if my description confuses; yes, 5 3 4 are different words so to say ( the words are separated by space). so, the set 5 and 3 and 4 are included in the set 12 and 11 and 10 and 9 and 8 and 7 and 5 and 4 and 3.
 
Old 09-05-2011, 08:54 AM   #6
cheddarcheese
Member
 
Registered: Aug 2011
Location: Massachusetts, USA
Distribution: Fedora; Centos; Puppy
Posts: 82

Rep: Reputation: 5
Okay, I get it (I think).

Well, not sure if you must do it as a shell script, but if you are able to do it with Perl, then you might do something like the following. I get the same results as you do in file 3 when I run this script. A couple of things though.

a) the ~~ operator (smart-match operator) is only available in more recent versions of Perl, so on an older version you would need to amend that a bit.
b) Not entirely sure without testing it a bit more if, with the ~~ operator it may match as positive a "1" in file 2, against a "12" (for example) in file 1 (which is not what you want, I know) ... in which case the smart operator line would need to be amended somehow anyway.
c) The script currently just prints to standard output, but can easily be amended to print to file 3
d) Rather than opening file 2 on each loop, it would be faster to read it in once, into a hash, and then loop the hash values ... but that shouldn't be too difficult either.

Code:
#!/usr/bin/perl
#
use strict;

open(F1,'file1.txt');

while (my $f1line=<F1>)
{
chomp($f1line);
my @f1nums=split(/ /,$f1line);

        open(F2,'file2.txt');

        while(my $f2line=<F2>)
        {
        chomp($f2line);
        my @f2nums=split(/ /,$f2line);

        my $flag=1;

                for (@f2nums)
                {
                if (!($_~~@f1nums))
                {
                $flag=0;
                last;
                }
                }

        print "$flag\t";
        }
        print "\n";
}
Anyway, not sure if that is in any way helpful, but it was an interesting little problem :-)

Last edited by cheddarcheese; 09-05-2011 at 08:58 AM.
 
Old 09-05-2011, 09:35 AM   #7
si-thk
LQ Newbie
 
Registered: Mar 2010
Posts: 8

Original Poster
Rep: Reputation: 0
Thanks for the reply and the code. Shell script is not a must for me, but I used it just because unsorted files with tens of thousands of words could very well be handled by them. BTW, your script seems attractive. I'll see whether it works well for long files. Thanks also for tip of hashing file2.
 
Old 09-05-2011, 09:43 AM   #8
cheddarcheese
Member
 
Registered: Aug 2011
Location: Massachusetts, USA
Distribution: Fedora; Centos; Puppy
Posts: 82

Rep: Reputation: 5
Sure, no worries ... But do take note of point (b) which I mentioned, because I'm really not 100% certain how the smart-match operator (~~) performs matches such as the one I suggested. You obviously only want a "1" to match against another "1", and not a "10", "11" or "12", for example. It may work the way you want, I'm just not certain. (although that line can probably be replaced with some other regex if required). But, anyway, you get the idea I'm sure ... Good luck!

Last edited by cheddarcheese; 09-05-2011 at 10:13 AM.
 
Old 09-05-2011, 10:39 AM   #9
si-thk
LQ Newbie
 
Registered: Mar 2010
Posts: 8

Original Poster
Rep: Reputation: 0
It works fine for part of the words also. No positive for 1 with 12, 11 and so on. Thanks.
 
Old 09-05-2011, 11:25 AM   #10
kurumi
Member
 
Registered: Apr 2010
Posts: 228

Rep: Reputation: 45
If you have Ruby(1.9+)

Code:
#!/usr/bin/env ruby
file2=File.open("file2").readlines.map!(&:split)
File.open("file1").each do |line|
  file1 = line.split
  file2.each{|f2| print f2.all?{|x| file1.include?(x)}  ? "1" :"0"}
  puts
end
test run:
Code:
$ ruby test.rb
101
111
001
000
001
 
Old 09-05-2011, 12:11 PM   #11
allend
Senior Member
 
Registered: Oct 2003
Location: Melbourne
Distribution: Slackware-current
Posts: 4,429

Rep: Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348
An interesting exercise for which I came up with this bash solution.
Code:
#!/bin/bash

rm OUTPUT.TXT
while read LINE1  ; do
   while read -a SEARCHTERMS ; do
    for (( i=1;i <= ${#SEARCHTERMS[@]};i++ )); do
      if echo $LINE1 | grep -q "${SEARCHTERMS[i-1]}" ; then
        if [ "$i" -eq "${#SEARCHTERMS[@]}" ] ; then
          echo -n "1 " >> OUTPUT.TXT;
        fi
      else
        echo -n "0 " >> OUTPUT.TXT;
        break;
      fi
    done
  done < FILE2
  echo "" >> OUTPUT.TXT
done < FILE1
Avoiding the use of sort should help with speed. I am also confident that a real bash expert could improve this!

Last edited by allend; 09-05-2011 at 12:13 PM.
 
Old 09-05-2011, 10:33 PM   #12
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,246

Rep: Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684
Well I am not sure if improved, but at least slightly different:
Code:
exec > FILE3
while read -r line
do
    while read -a arr
    do
        ans=1
        for i in ${arr[*]}
        do
            reg="\<$i\>"
            [[ "$line" =~ $reg ]] || ans=0
        done
        echo -n "$ans "
    done<FILE2
    echo
done<FILE1
 
1 members found this post helpful.
Old 09-06-2011, 02:52 AM   #13
si-thk
LQ Newbie
 
Registered: Mar 2010
Posts: 8

Original Poster
Rep: Reputation: 0
Thanks everybody. It seems all the 3 codes are efficient than mine. have to try for very large files. Thanks again for the help.
 
Old 09-06-2011, 08:59 AM   #14
allend
Senior Member
 
Registered: Oct 2003
Location: Melbourne
Distribution: Slackware-current
Posts: 4,429

Rep: Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348Reputation: 1348
@grail - Very neat! Thanks for the instructive demonstration on the use of shell binary operator with a regular expression. Obviously much cleaner than calling grep.
Perhaps retaining the break in the innermost loop would be more efficient. It only takes one missing match to cause a zero to be written to the output file, so there is no need to test all possibilities.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Compare semicolon seperated data in 2 files using shell script novice82 Linux - Newbie 4 11-17-2009 06:26 PM
How to Compare Two files using shell script pooppp Linux - Networking 14 08-05-2008 04:35 AM
Comparing two files to get matched contents in another file using shell script pooppp Linux - Networking 3 08-05-2008 01:11 AM
shell script: compare 2 files anhtt Programming 6 08-29-2007 03:39 AM
to compare the contents of 2 files MaleWithBrains Linux - Newbie 3 01-27-2004 07:39 PM


All times are GMT -5. The time now is 06:18 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration