efficient shell script to compare contents of files
Hi, I need to compare lines of one file with that of another file. For example:
File1: 12 11 10 9 8 7 5 4 3 15 14 13 12 11 10 9 8 7 6 5 4 3 14 13 12 11 10 9 8 5 11 10 9 8 7 10 8 7 6 5 3 has to be compared with File2: 5 3 4 4 7 5 6 5 10 Giving an output File3: 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 I have written a shell script to do this, which works fine. #!/bin/sh > file3 while read line do >c.txt echo $line| tr -s ' ' '\n' > line while read line1 do echo $line1| tr -s ' ' '\n'|sort -rn > line1 fgrep -vf line line1 > e.txt FILE=e.txt if [ -s $FILE ] ; then echo "0" >> c.txt else echo "1" >> c.txt fi done < file2 cat c.txt|awk '{printf ($1 " " )}'>> file3 echo >> file3 done < file1 The problem with this code is that it take quite a long time when the number of entires in file 1 and 2 gets the order of thousands. Any suggestion for efficient execution is deeply appreciated. Thanks in advance. |
Could you perhaps first explain how the output is generated? ... ie how does 12 11 10 9 8 7 5 4 3 and 5 3 4 translate to 1 0 1?
|
Thanks for the reply. What compares in two files is the lines of file1 with lines of file2. If the line 1 of file 2 (5 3 4) is completely embed in line1 of file1 then in the first place (1,1) of file3 is 1 (means, yes 5 3 4 is included in 12 11 10 9 8 7 5 4 3). The next 0 in 1 0 1 is the match of 12 11 10 9 8 7 5 4 3 and 4 7 5 6 (which is no, and hence 0 in the place (1,2) ). The 1 in place (1,3) of file3 says 5 10 is in 12 11 10 9 8 7 5 4 3. That is how the line 1 0 1 in file 3 comes. The 2nd line in file3 comes by comparing the 2nd line of file1 with the 1st 2nd and 3rd lines of file2. and the 3rd line of file 3 comes from the 3rd line of file1 with 1,2,3 of file2, and so on.
|
I'm a bit confused. You said, "...If the line 1 of file 2 (5 3 4) is completely embed in line1 of file1 then in the first place (1,1) of file3 is 1 ...."
Well, "5 3 4" is not completely embedded in line 1 of file 1 (although "5 4 3" is), and yet the position (1,1) in file 3 is still a 1. Or do you mean if 5 AND 3 AND 4 are all, separately, identified in line 1 of file 1? |
I'm sorry if my description confuses; yes, 5 3 4 are different words so to say ( the words are separated by space). so, the set 5 and 3 and 4 are included in the set 12 and 11 and 10 and 9 and 8 and 7 and 5 and 4 and 3.
|
Okay, I get it (I think).
Well, not sure if you must do it as a shell script, but if you are able to do it with Perl, then you might do something like the following. I get the same results as you do in file 3 when I run this script. A couple of things though. a) the ~~ operator (smart-match operator) is only available in more recent versions of Perl, so on an older version you would need to amend that a bit. b) Not entirely sure without testing it a bit more if, with the ~~ operator it may match as positive a "1" in file 2, against a "12" (for example) in file 1 (which is not what you want, I know) ... in which case the smart operator line would need to be amended somehow anyway. c) The script currently just prints to standard output, but can easily be amended to print to file 3 d) Rather than opening file 2 on each loop, it would be faster to read it in once, into a hash, and then loop the hash values ... but that shouldn't be too difficult either. Code:
#!/usr/bin/perl |
Thanks for the reply and the code. Shell script is not a must for me, but I used it just because unsorted files with tens of thousands of words could very well be handled by them. BTW, your script seems attractive. I'll see whether it works well for long files. Thanks also for tip of hashing file2.
|
Sure, no worries ... But do take note of point (b) which I mentioned, because I'm really not 100% certain how the smart-match operator (~~) performs matches such as the one I suggested. You obviously only want a "1" to match against another "1", and not a "10", "11" or "12", for example. It may work the way you want, I'm just not certain. (although that line can probably be replaced with some other regex if required). But, anyway, you get the idea I'm sure ... Good luck!
|
It works fine for part of the words also. No positive for 1 with 12, 11 and so on. Thanks.
|
If you have Ruby(1.9+)
Code:
#!/usr/bin/env ruby Code:
$ ruby test.rb |
An interesting exercise for which I came up with this bash solution.
Code:
#!/bin/bash |
Well I am not sure if improved, but at least slightly different:
Code:
exec > FILE3 |
Thanks everybody. It seems all the 3 codes are efficient than mine. have to try for very large files. Thanks again for the help.
|
@grail - Very neat! Thanks for the instructive demonstration on the use of shell binary operator with a regular expression. Obviously much cleaner than calling grep.
Perhaps retaining the break in the innermost loop would be more efficient. It only takes one missing match to cause a zero to be written to the output file, so there is no need to test all possibilities. |
All times are GMT -5. The time now is 11:39 AM. |