way of ignoring out of order lines in diff?

Geneset · 06-04-2009, 09:51 AM

Hi Folks.

I'm working at diffing html pages and mysql db dumps and I'm coming across a (minor) issue.

I can use the regular expression engine to ignore certain "words" that i know to have changed (dates, version numbers, hostnames, etc) but (for a reason that is being hunted separatly) occasionally data is reported on the webpage out of order.

This behaviour is not a fault and i know that i can do

Code:

diff <(sort filenameA ) <(sort filenameB)

And that solves the issue with regards to one file.

But im dealing with snapshots consisting of hundreds of pages, and have been recursivly diffing on each directory successfully, solving all comparision problems but this, and leaving me with one file that lists the differences between the directories for each file.

i could do something along the lines of

Code:

diff <(find dirnameA | xargs sort) <(find dirnameB | xargs sort)

But then to diff this just looks like one long file and i lose the delineation between files.

After inspecting the man pages i cant find a (obvious) way to ignore out of order lines.

Anyone have any bright ideas on either:

A) a regex i could use in diff to compare each line to the other lines in the current file.
B) a way of post processing the resulting diff file to excise the offending swaps

Regards
G

MensaWater · 06-04-2009, 09:56 AM

Have you tried sdiff? It tries to put the differences side by side. I often find it more useful than diff though not quite a perfect tool.

Geneset · 06-04-2009, 10:01 AM

i thought of that also, and was my initial vector at this issue, but without the regex capabilities of diff, the percieved error rate greatly increases; all of the changes we already knew about get displayed, instead of reducing the number of observed difference to things that we DONT know about.
Thank you for your feedback jl