shell script to recursively "compare" all files in a directory...

silex_88 · 05-12-2007, 12:03 AM

Hi, the command I'm using has the format:

arrow --query=FILE1 --compare=FILE2

I want FILE1 and FILE2 to go through all files in a certain directory recursively.

My attempt was

arrow --query=`find -type f DIR` --compare=`find -type f DIR`

However I get a "too many argument" error message.

jschiwal · 05-12-2007, 12:40 AM

I am not familiar with an arrow command.
Rather than cycling through all of the files and comparing it to remaining files in a list, I would run "md5sum" or "sum" on all of the files, and then locate duplicate checksum values.

You can do this for files in various subdirectories as well.

Code:

find . -maxdepth 1 -type f -exec md5sum '{}' \; >md5sumlist
cut -d' ' -f1  | sort md5sumlist | uniq -d >acopylist
grep -f acopylist md5sumlist

I think you can shorten this up using the "-w32 -D" for uniq. Then the grep command might not be necessary.

Code:

find . -maxdepth 1 -type f -exec md5sum '{}' \; | sort | uniq -w32 -D

----

Quote:

arrow --query=`find -type f DIR` --compare=`find -type f DIR`

1) The find command doesn't look right. The directory to base the search should come first.
2) --query=FILE1 --compare=FILE2 implies that the argument to query should be a single file instead of every file in the directory. Even if it allowed a number of files such as
arrow --compare=FILE --query="FILE1 FILE2 ..."
there could still be a problem if the number of files in the directory is to large.

silex_88 · 05-12-2007, 12:47 AM

Hi! Thanks for the prompt reply. I'm afraid I wasn't very clear: the arrow command I'm executing is custom, and I'm just looking for a way to give it all the files in a directory as parameters. In pseudo code, I want to do

Code:

for each FILE1 in DIR #recursive
do
  for each FILE2 in DIR #recursive
  do
    if FILE1 != FILE2
    do
      arrow --query=FILE1 --compare=FILE2 >> output.txt
    end
  end
end

where DIR is the same directory for both loops. However, I don't know how to write that as a shell script. Maybe it's easier in Python? If so, please show me how

jschiwal · 05-12-2007, 04:24 AM

Code:

# Demonstrate reading in a list of regular files into a variable array
# Show ways of manipulating the array variable and displaying elements.

for file1 in $DIR/*; do
   for file2 in $DIR/*; do
      if [ -f $file1 ] && [ $file1 != $file2 ]; then 
         arrow --query=$file1 --compare=$file2 >> output.txt
      fi
   done
done

The "[ -f $file1 ]" tests if the filename is a regular file. It could be a directory.

This isn't a good way of doing it. When $file1 is the first file, it will be compared against the 2nd to last file. The $file2 loop should start with the next file after $file1. It would be better to have an array containing the filenames. In the outer loop, loop through 1 .. n. In the inner loop, loop through file1+1 .. n.

Your way will execute n squared times. This way will execute (n-1)n / 2 times.

Look at this example for inspiration:

Code:

# Change IFS so that filenames with spaces don't get split up
ifs=$IFS
IFS='
'
# Fill an array with the files in the current directory
files=($(find . -maxdepth 1 -type f ))

# Restore old IFS value
IFS=$ifs

# Display the index of the last file
LAST=$((${#files[@]}-1)) 
echo $LAST

# Display the contents of the array
for (( i=0; i<=$LAST; i++ )); do echo $i : "${files[$i]}"; done

See the array section of the info bash manual for more details.