LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 04-16-2009, 08:16 PM   #1
johnsfine
LQ Guru
 
Registered: Dec 2007
Distribution: Centos
Posts: 5,286

Rep: Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197
Diff for files with resequenced chunks


When I diff two text files (which I do quite often, for a variety of reasons), usually significant chunks have moved rather than changed. I've used many different diff programs, but never one that understands the difference between moving and changing text.

I'd like to have a diff program for general use that somehow detects and somehow reports moved chunks differently than changed chunks. I don't have any great idea how to do either, nor even how to define the difference between moved and changed when both are occurring. I'm just hoping someone has figured it out and put that in some diff program.

But at the moment, I'm trying to compare several pairs of very large files in which almost everything is in resequenced moderately large chunks. I want to ignore all the re sequencing and find the largest of the differences that aren't re sequencing. I don't know any diff tool that is even helpful. They just see the two files as almost totally different with lots of tiny matching bits (where a few short lines in a row have common contents). Even those matching bits aren't true relocated chunks.

If I were to code a program to do that myself, I might:

1) Create a "node" object representing each position in a file at which a line starts.

2) Sort all those nodes into lexical sequence (If the line content is different the nodes sort by that content. If same then by the next line, etc). That is a simple (though slow) comparison operator for an ordinary sort operation.

3) Compare the sorted node sequence of one file with the sorted node sequence of the other. I'm not sure of exact details, but in that sequence, almost every position in one file could be easily paired one to one with its best match in the other file.

4) Drop all the nodes (which would be the vast majority for the data I want to compare) that either are good matches (many characters long, spanning newlines) or have their whole first line within an earlier good match. Probably this is easiest by sorting back to the original node sequence carrying that pairing data along.

5) Group and report (back in original sequence) all the line starts that were not dropped.

If I haven't lost you, finally some questions:

A) What programs already exist that do some decent part of this job? I'd rather not side track from what I'm actually trying to do into building a tool for the comparison.

B) The approach I described above is a first idea on what is a rather complex problem. Do you know a better approach? All those comparisons between semi random substrings of a multi-hundred-MB data set will cause massive cache misses and run very slowly.

C) If I do write the program, what about presentation of results? A typical GUI diff program (Winmerge, kdiff3, etc.) does a good job of displaying a file with difference points highlighted to let you browse through it and jump to differences and view them in context.

I'd like to do roughly the same (Implying not really dropping the nodes I said "drop" above). On one side, go to an unmatched point and view it in context, including matched chunks above or below. On the other side line up the context of one of those matched chunks.

I certainly don't want to write all that presentation code. Is there some open source GUI tool that does that kind of presentation with clear enough source code that it would be easy to replace the processing code but keep the presentation code?
 
Old 04-16-2009, 10:43 PM   #2
PTrenholme
Senior Member
 
Registered: Dec 2004
Location: Olympia, WA, USA
Distribution: Fedora, (K)Ubuntu
Posts: 4,187

Rep: Reputation: 354Reputation: 354Reputation: 354Reputation: 354
I don't know of any general program to do what you want, but last November another user posted a question about finding all the matched "sections" in a set of files. We developed a gawk program to do the matching which you might be able to modify to solve your problem.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
diff two files noir911 Linux - Server 3 03-25-2009 05:00 PM
How to sort two files then diff them? zoubidoo Linux - General 2 01-23-2009 04:42 AM
is it possible to diff ps files? markhod Linux - General 8 09-05-2005 12:17 AM
.diff files. How do I apply them? tardigrade Linux - Software 1 08-14-2005 09:05 PM
.diff files? jtsai256 Linux - Newbie 1 09-28-2003 02:24 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 05:26 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration