Comparing Similarity of Two Files:

ali2011 · 12-22-2011, 12:10 AM

I have two files as following:

Code:

a.txt

2 4 6 7 8 91
3 6 87 44 3 122 8 15

b.txt

2 4 6 66 9 19 91
3 5 77 5 3 15

each file has around 19988 line; the start and end of each line in a.txt are the same in its corresponding line in b.txt "lines #1 in a.txt and #1 in b.txt both begin with 2 and finish with 91, and so on for all lines". Lines can have different lenghts even corresponding lines "line #1 from a.txt has length = 5, but #1 in b.txt has length 6". The length is: the number if numbers - 1.

Now, what I'm looking to know, is figuring out for how much similar corresponding lines to each other, e.g:

Line #1 from a.txt: 2 4 6 7 8 91
Line #1 from b.txt: 2 4 6 66 9 19 91

From left to right, (2 to 4) and (4 to 6) the only two jumps shared by both lines so the jump-similarity degree is 2. Also, How many numbers are shared by both lines? (2,4,6,91) only, so the node-similarity degree is 4-2 = 2 since the start and end are always the same as I mentioned earlier. I'll appreciate your help on this!

sundialsvcs · 12-22-2011, 04:35 AM

In some (any...) real programming language (don't try to use Bash scripting for this ... please!) you will first parse each line into a vector:

vec = [ 2, 4, 6, 7, 8, 91 ];

and perhaps you need to also need to produce a "sorted and de-duped" version of that (not applicable here)

and if necessary you could further develop a vector of vectors:

jump_vec = [ [undef, 2], [2, 4], [4, 6], [6, 7], [7, 8], [8, 91], [91, unedef] ];

You should also first do careful research to see if you are, in fact, solving a problem that has already been solved before, such that you do not actually need to write new code to do any part of it (other than, say, the text parsing, which is trivial with regular expressions).

Then, bring to bear the real programming-language of your choice that has good support for vectors. Perl, Python, Ruby ... not Bash (which isn't a programming language anyway, and please don't start a tangent on this) and not C/C++ (which would be overkill). You want to find and use just as much already built and tested code as you can find ... over here, for instance.

ali2011 · 12-22-2011, 05:06 AM

Unfortunately, my experience is only on Socket Programming and Matlab. On all the languages you mentioned I have very little knowledge.

theNbomr · 12-22-2011, 09:39 AM

You seem to need to develop your program as two fundamental parts: one that implements the comparison of two records, and produces some measure of similarity according to your requirements, and an iteration component that reads one record from each file and calls the comparison routine, passing the two records to it on each iteration. Shell scripting is probably sub-optimal for this, but depending in the complexity of your comparison algorithm, is probably do-able. You should be able to focus your design on these two elements more or less independently; the divide and conquer principle.
No one here is likely to fully understand your requirements for the record comparison algorithm without a significantly more detailed description. You need to do this anyway, as part of your design process. Developing a rigorous specification should help you understand the probable method/algorithm that will ultimately be used. On the matter of the outer layer that iterates over all records in the files, that should be easily done with standard shell looping constructs and file IO. Shell commands/keywords like while and for are going to be part of the looping code. Getting data records from files will probably use read. If you choose to implement the code in some other language, the basic structure should probably be the same.

Start writing some code, and when you bump into roadblocks, post the relevant fragments for specific help.

--- rod.

anomie · 12-22-2011, 11:25 AM

Assuming I am understanding the problem correctly, you might check out this utility (and its approach): http://ssdeep.sourceforge.net/