LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 12-22-2011, 12:10 AM   #1
ali2011
Member
 
Registered: Nov 2011
Location: USA, CA
Distribution: Ubuntu+Fedora
Posts: 80

Rep: Reputation: Disabled
Comparing Similarity of Two Files:


I have two files as following:

Code:
a.txt

2 4 6 7 8 91
3 6 87 44 3 122 8 15

b.txt

2 4 6 66 9 19 91
3 5 77 5 3 15
each file has around 19988 line; the start and end of each line in a.txt are the same in its corresponding line in b.txt "lines #1 in a.txt and #1 in b.txt both begin with 2 and finish with 91, and so on for all lines". Lines can have different lenghts even corresponding lines "line #1 from a.txt has length = 5, but #1 in b.txt has length 6". The length is: the number if numbers - 1.

Now, what I'm looking to know, is figuring out for how much similar corresponding lines to each other, e.g:

Line #1 from a.txt: 2 4 6 7 8 91
Line #1 from b.txt: 2 4 6 66 9 19 91

From left to right, (2 to 4) and (4 to 6) the only two jumps shared by both lines so the jump-similarity degree is 2. Also, How many numbers are shared by both lines? (2,4,6,91) only, so the node-similarity degree is 4-2 = 2 since the start and end are always the same as I mentioned earlier. I'll appreciate your help on this!

Last edited by ali2011; 12-22-2011 at 12:12 AM.
 
Old 12-22-2011, 04:35 AM   #2
sundialsvcs
LQ Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 10,671
Blog Entries: 4

Rep: Reputation: 3945Reputation: 3945Reputation: 3945Reputation: 3945Reputation: 3945Reputation: 3945Reputation: 3945Reputation: 3945Reputation: 3945Reputation: 3945Reputation: 3945
In some (any...) real programming language (don't try to use Bash scripting for this ... please!) you will first parse each line into a vector:

vec = [ 2, 4, 6, 7, 8, 91 ];

and perhaps you need to also need to produce a "sorted and de-duped" version of that (not applicable here)

and if necessary you could further develop a vector of vectors:

jump_vec = [ [undef, 2], [2, 4], [4, 6], [6, 7], [7, 8], [8, 91], [91, unedef] ];

You should also first do careful research to see if you are, in fact, solving a problem that has already been solved before, such that you do not actually need to write new code to do any part of it (other than, say, the text parsing, which is trivial with regular expressions).

Then, bring to bear the real programming-language of your choice that has good support for vectors. Perl, Python, Ruby ... not Bash (which isn't a programming language anyway, and please don't start a tangent on this) and not C/C++ (which would be overkill). You want to find and use just as much already built and tested code as you can find ... over here, for instance.

Last edited by sundialsvcs; 12-22-2011 at 04:39 AM.
 
Old 12-22-2011, 05:06 AM   #3
ali2011
Member
 
Registered: Nov 2011
Location: USA, CA
Distribution: Ubuntu+Fedora
Posts: 80

Original Poster
Rep: Reputation: Disabled
Unfortunately, my experience is only on Socket Programming and Matlab. On all the languages you mentioned I have very little knowledge.
 
Old 12-22-2011, 09:39 AM   #4
theNbomr
LQ 5k Club
 
Registered: Aug 2005
Distribution: OpenSuse, Fedora, Redhat, Debian
Posts: 5,399
Blog Entries: 2

Rep: Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908
You seem to need to develop your program as two fundamental parts: one that implements the comparison of two records, and produces some measure of similarity according to your requirements, and an iteration component that reads one record from each file and calls the comparison routine, passing the two records to it on each iteration. Shell scripting is probably sub-optimal for this, but depending in the complexity of your comparison algorithm, is probably do-able. You should be able to focus your design on these two elements more or less independently; the divide and conquer principle.
No one here is likely to fully understand your requirements for the record comparison algorithm without a significantly more detailed description. You need to do this anyway, as part of your design process. Developing a rigorous specification should help you understand the probable method/algorithm that will ultimately be used. On the matter of the outer layer that iterates over all records in the files, that should be easily done with standard shell looping constructs and file IO. Shell commands/keywords like while and for are going to be part of the looping code. Getting data records from files will probably use read. If you choose to implement the code in some other language, the basic structure should probably be the same.

Start writing some code, and when you bump into roadblocks, post the relevant fragments for specific help.

--- rod.
 
Old 12-22-2011, 11:25 AM   #5
anomie
Senior Member
 
Registered: Nov 2004
Location: Texas
Distribution: RHEL, Scientific Linux, Debian, Fedora
Posts: 3,935
Blog Entries: 5

Rep: Reputation: Disabled
Assuming I am understanding the problem correctly, you might check out this utility (and its approach): http://ssdeep.sourceforge.net/
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Comparing Files to MD5 dudeman41465 Linux - Newbie 2 04-11-2011 07:19 AM
Comparing two files ab52 Programming 10 12-01-2010 11:08 AM
comparing files newbiesforever Linux - Software 3 07-07-2010 03:20 PM
comparing directories and files crazy8 Linux - Newbie 4 01-16-2008 10:33 AM
Comparing 2 Files xianzai Programming 2 05-23-2004 11:50 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 05:41 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration