LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (https://www.linuxquestions.org/questions/linux-software-2/)
-   -   Diff for files with resequenced chunks (https://www.linuxquestions.org/questions/linux-software-2/diff-for-files-with-resequenced-chunks-719712/)

johnsfine 04-16-2009 08:16 PM

Diff for files with resequenced chunks
 
When I diff two text files (which I do quite often, for a variety of reasons), usually significant chunks have moved rather than changed. I've used many different diff programs, but never one that understands the difference between moving and changing text.

I'd like to have a diff program for general use that somehow detects and somehow reports moved chunks differently than changed chunks. I don't have any great idea how to do either, nor even how to define the difference between moved and changed when both are occurring. I'm just hoping someone has figured it out and put that in some diff program.

But at the moment, I'm trying to compare several pairs of very large files in which almost everything is in resequenced moderately large chunks. I want to ignore all the re sequencing and find the largest of the differences that aren't re sequencing. I don't know any diff tool that is even helpful. They just see the two files as almost totally different with lots of tiny matching bits (where a few short lines in a row have common contents). Even those matching bits aren't true relocated chunks.

If I were to code a program to do that myself, I might:

1) Create a "node" object representing each position in a file at which a line starts.

2) Sort all those nodes into lexical sequence (If the line content is different the nodes sort by that content. If same then by the next line, etc). That is a simple (though slow) comparison operator for an ordinary sort operation.

3) Compare the sorted node sequence of one file with the sorted node sequence of the other. I'm not sure of exact details, but in that sequence, almost every position in one file could be easily paired one to one with its best match in the other file.

4) Drop all the nodes (which would be the vast majority for the data I want to compare) that either are good matches (many characters long, spanning newlines) or have their whole first line within an earlier good match. Probably this is easiest by sorting back to the original node sequence carrying that pairing data along.

5) Group and report (back in original sequence) all the line starts that were not dropped.

If I haven't lost you, finally some questions:

A) What programs already exist that do some decent part of this job? I'd rather not side track from what I'm actually trying to do into building a tool for the comparison.

B) The approach I described above is a first idea on what is a rather complex problem. Do you know a better approach? All those comparisons between semi random substrings of a multi-hundred-MB data set will cause massive cache misses and run very slowly.

C) If I do write the program, what about presentation of results? A typical GUI diff program (Winmerge, kdiff3, etc.) does a good job of displaying a file with difference points highlighted to let you browse through it and jump to differences and view them in context.

I'd like to do roughly the same (Implying not really dropping the nodes I said "drop" above). On one side, go to an unmatched point and view it in context, including matched chunks above or below. On the other side line up the context of one of those matched chunks.

I certainly don't want to write all that presentation code. Is there some open source GUI tool that does that kind of presentation with clear enough source code that it would be easy to replace the processing code but keep the presentation code?

PTrenholme 04-16-2009 10:43 PM

I don't know of any general program to do what you want, but last November another user posted a question about finding all the matched "sections" in a set of files. We developed a gawk program to do the matching which you might be able to modify to solve your problem.


All times are GMT -5. The time now is 04:34 PM.