[SOLVED] diff ignore files in only one directory but not in another

zhjim · 10-24-2013, 05:11 AM

Hi folks,

comparing two directories with diff I like to exclude certain files in a directory that match a specific suffix. But not ignore them in another directory. Not getting it to work. This is my case

Code:

root@VM-box-2:~/abgleich# diff -rq -x '.git' system-skripte system-skripte_neu
Only in system-skripte_neu: all
Only in system-skripte: ftp_user.sh
Only in system-skripte_neu: less
Only in system-skripte_neu/mailer: all
Only in system-skripte/mailer: domain_delete
Files system-skripte/mailreport.sh and system-skripte_neu/mailreport.sh differ
Only in system-skripte_neu: more

Thats my starting point. Now I want to exclude the file all but only in the directory mailer. So I add -x '*mailer/all' but still it shows up.

Code:

root@VM-box-2:~/abgleich# diff -rq -x '.git' -x '*mailer/all' system-skripte system-skripte_neu
Only in system-skripte_neu: all
Only in system-skripte: ftp_user.sh
Only in system-skripte_neu: less
Only in system-skripte_neu/mailer: all
Only in system-skripte/mailer: domain_delete
Files system-skripte/mailreport.sh and system-skripte_neu/mailreport.sh differ
Only in system-skripte_neu: more

Using only -x 'all' removes both entries.

Even giving the full path 'system_skripte_neu/mailer/all' does not work as are combination with '*' and '.'. Is it even possible to give a pattern with a path for exclusion to diff?
Version in use: diff 3.2 on debian 7.2

Any ideas how I can do this? Or should I revert to rsync in dry run mode?

rtmistler · 10-24-2013, 11:57 AM

This is something very basic that I've used for years to compare my latest working code against my last release directory for a certain project. I've evolved it over time and obviously I have a variety of files in my directories. I also compare from A->B direction and then B->A direction so determine if I've added or deleted files as I've progressed.

The main points which may be useful for you is that I establish list variables of the file types which match my search criteria. Therefore if you can find a subset of files by a grep criteria, thus excluding the ones you wish to ignore, then you can later use those lists in the for loop to perform the diff comparisons.

I'm also not the most elegant script writer, I get done what I need to get done and move on; I don't re-refine stuff like that until I find a flaw. Only one I recall on this one that bugged me was that my development directories were named without spaces, but when I made them an official release the names of the directory contained spaces and caused problems, so I recall fixing that with added quotations.

Code:

#!/bin/sh
# Script to diff two development directories

#set -xv

if [ $# -ne 2 ]; then
    echo "Usage: diff_files.sh dir1 dir2";
    exit -1;
fi

HOME=$PWD
DIR1=$HOME/$1
DIR2=$HOME/$2
cd "$DIR1"
C1LIST=`find . -name "*.c"`
H1LIST=`find . -name "*.h"`
CPP1LIST=`find . -name "*.cpp"`
PNG1LIST=`find . -name "*.png"`
MAKEFILE1LIST=`find . -name Makefile`
PRO1LIST=`find . -name "*.pro"`
QRC1LIST=`find . -name "*.qrc"`
SH1LIST=`find . -name "*.sh"`
WAV1LIST=`find . -name "*.wav"`
CFG1LIST=`find config -name "*"`

cd "$DIR2"
C2LIST=`find . -name "*.c"`
H2LIST=`find . -name "*.h"`
CPP2LIST=`find . -name "*.cpp"`
PNG2LIST=`find . -name "*.png"`
MAKEFILE2LIST=`find . -name Makefile`
PRO2LIST=`find . -name "*.pro"`
QRC2LIST=`find . -name "*.qrc"`
SH2LIST=`find . -name "*.sh"`
WAV2LIST=`find . -name "*.wav"`
CFG2LIST=`find config -name "*"`

echo ""; echo ">>>>  STARTING COMPARISONS  <<<<"; echo "";

cd "$DIR1"
DIFF1CNT=0
for i in $C1LIST $H1LIST $CPP1LIST $PNG1LIST $MAKEFILE1LIST $PRO1LIST $QRC1LIST $SH1LIST $WAV1LIST $CFG1LIST
do
    if [ ! -e "$DIR2/$i" ]; then
        echo "      File $i does NOT exist in $2";
    else
        diff -q $i "$DIR2/$i";
        if [ $? -ne 0 ]; then
            let "DIFF1CNT += 1";
        fi
    fi
done

echo ""; echo ">>>>  REVERSING DIRECTION  <<<<"; echo "";

cd "$DIR2"
DIFF2CNT=0
for i in $C2LIST $H2LIST $CPP2LIST $PNG2LIST $MAKEFILE2LIST $PRO2LIST $QRC2LIST $SH2LIST $WAV2LIST $CFG2LIST
do
    if [ ! -e "$DIR1/$i" ]; then
        echo "      File $i does NOT exist in $1";
    else
        diff -q $i "$DIR1"/$i;
        if [ $? -ne 0 ]; then
            let "DIFF2CNT += 1";
        fi
    fi
done

if [ $DIFF1CNT -ne $DIFF2CNT ]; then
    echo ""; echo "There was a DIFFERENT AMOUNT of file compare counts between FW and REV comparisons"; echo "";
else
    echo ""; echo "There was NO DIFFERENCE in file compare counts between FW and REV comparisons"; echo "";
fi

echo ">>>>  DONE  <<<<"; echo "";

exit 0

zhjim · 10-25-2013, 02:14 AM

Thanks for the script rtmistler just I want a blacklist as you are working with a whitelist. I also find some other options that walk the directory tree with find and then grep it by specific pattern so to know if the file should be examined or not. As I do not really need to see the difference of the files only if the differ I used rsync to get the job done.

The problem here is that one has to interpret the output in some ways. And be absolute sure about the direction of the compare. I also had to dig on those options alot to get only those things I was intrested in. Heres the line I use

Code:

rsync -avrIcO --del dir1/ dir2/

-I and -r are also within -a but just to make sure. -I also checks the files if the timestamp is the same, -r is recursive and -a for archive. Just a lot of options to synchronation things like permission, timestamp acls and a like. -v is for verbose and creates the output. One could use -i for a complete log but that needed to be parsed. -O is to not have directories that differ in time to be shown in the output. Also I guessed that this would be taken with the -I option but -O does what it does. The real deal was the -c option. It always makes a checksum of files but only a quick one. With this option only the files that differ are printed with the -v option. Finnaly --del to see files that are not in the second dir allready. Extrearnous files in dir1 just go with the normal output

Sample run:

Code:

root@VM-box-2:~/abgleich# ./compare.sh ./system-skripte/ ./system-skripte_neu/
sending incremental file list
deleting more
deleting less
deleting all
ftp_user.sh
mailreport.sh
deleting mailer/all
mailer/domain_delete
neuverz/

If you dont know anything about the directory I guess one could not really use the output. Beside deleting filename. One now has to examine every file and see whats up with this. Maybe go for a diff now to ease the pain.

rtmistler · 10-25-2013, 07:13 AM

That sounds a lot like maintaining a repository for an SCCS. Perhaps RCS could help you instead of doing all that manually.

Either case, that's what I'm thinking about here. You'll need to maintain a diary of the trees of every directory which you wish to use this script for, and then in the intelligence of each tree, you'll need to maintain an entry for each directory to indicate the relevant information; be that size, last modification date, extension; and so forth.

My concern there would be that I'd be spending so much time with housekeeping that I'd never finish. The comparisons and producing the output are minimal, it's the maintenance of your digest which will be the problem.

A reason why my script works for me is that my conditions are bounded by the conventions under which I code a particular project. Perhaps if you can bound the structure and content of directories, you can make the task easier.

One way I'd attack this would be to reach each top PWD and perform a consistently formatted output command, such as:

Code:

ls -lRrt >> this-directory-digest.log

This way I'd get a full listing of all files, symbolic links, and directories, including their privileges, the directory name under which they reside, the attributes for each file; ownership, group, privileges, plus size and modification date. Then I'd process that file into my intended search and compare arrays. You know, divide and conquer. Derive my white and black lists from that search and sort action; and once all things were organized, fire off the diff actions.

Best of luck!

zhjim · 10-25-2013, 09:58 AM

The whole story actually is about a website that got change on disk, namely copied over and then moved files and also changing some path inside the php files, and also the repos of this website got changed and now I have to find a way where what kind of change occured.
So first step is to get the difference between the orignial (inactive) and the new (active) part of the website. Incorperate those changes into the repository and then see what also changed inside the repository and the new website. Guess when things need to be done quick you can't do them right. Plan is another four letter word

selfprogrammed · 10-25-2013, 01:51 PM

I do this often and I use
>> diff -r -U4 dir1 dir2 > dir12.diff

Then I go in and start editing the diff, cutting and pasting parts of it to separate files.
I use jed with 2 to 10 files open at one time, cutting from one and picking another to put it into.
I end up with one file I can use as a patch, and other files that I can use to go back and hand patch. I can even copy lines from the diff and paste them in to source, with only a delete of the first character needed.

This is the method I use to review a debugging copy of the source code for changes and to apply selected code from that session to creating patch sets that will go into the SVN.
Each patch then gets applied, compiled, and tested to verify it is a complete patch,
then it gets committed.

Can also use grep on the diff output to find selected lines, or exclude selected lines
to make a derivative diff listing. Problem is that automated mangling of the diff will usually miss something, so just looking over in an editor is always safer. It is so easy to delete whole sections of the diff that it is not worth the time to debug special tools.
If you have a large number of identical edits to do then consider.
- can use replace to edit file names to something that will redirect patch elsewhere
- can use sed to mass edit the diff
- can write a python script to make mass changes

Spect73 · 10-25-2013, 07:22 PM

Quote:

Originally Posted by zhjim

The whole story actually is about a website that got change on disk, namely copied over and then moved files and also changing some path inside the php files, and also the repos of this website got changed and now I have to find a way where what kind of change occured.
So first step is to get the difference between the orignial (inactive) and the new (active) part of the website. Incorperate those changes into the repository and then see what also changed inside the repository and the new website. Guess when things need to be done quick you can't do them right. Plan is another four letter word

Not sure I'm understanding enough to make an intelligent reply. I track a site: www.bitsavers.org to find out what has been added/deleted/moved around. The only way I was able to do it was to write a C
program that could querry the site, get the table of contents for each directory, and use this information to compare against the contents of my local mirror.

Cordially

zhjim · 11-07-2013, 05:20 AM

Thanks for the all the input. I used got along with the rsync line and did the sync by hand. I just used the output of the rsync command and then diff'ed every file that was not deleted or added.
The main problem was that I had to do it twice. First to get all the difference of the two directories on the server and get those differences into the repository. After that check out the repository and compare that to the now synced directory on the server.

We had to move a webpages to a new directory and due to this do some changes to path inside the code of the page. Instead of first doing it in the newly created repository we did it on the live system. In the meantime there also were changes to the code inside the repos. So we had 3 locations (2 directories, 1 repos) and all had some changes on their own. Actually not only the pathes were changed but also some kinda bug fixes I call them. It was all just okay we do it now and care later. Anyways its done and working now.