-   Linux - Newbie (
-   -   I need a command that can delete the following... (

KevinAlaska 05-25-2007 03:23 PM

I need a command that can delete the following...
Hi everyone and thank you for reading my post.

I have been importing all my photos from about 5 different backups. I am consolidating all my photos so I don't miss a single photo. The problem with this that there are lots of duplicate files being imported. The files are imported into folders by date taken. (ie /home/myname/Photos/<year>/<month>/<day>/<photos> so for example /home/myname/Photos/2007/05/25/<photos>).

So I have about 30,000 photos in there and about 60 percent are probably duplicates that get renamed like the following by F-Stop when they are imported: 'photo123.jpg' if it already exists would be renamed with the -1 at the end of it like 'photo123-1.jpg then the next one would be 'photo123-2.jpg' etc etc.

I had this command given to me but I can't find it for the life of me. But the good news is my real desks desktop is now clean in the process of looking for it.

Well I have I have not forgotten anything here. Thank you for all the help.


Kevin in Alaska

Emerson 05-25-2007 03:34 PM

GImageView can find duplicates. There is also Dupefinder for QT and CLI. And I'm sure there are many more. :)

KevinAlaska 05-25-2007 03:51 PM

Thank you for the info...

I am very to to linux and not very keen on installing stuff thats listed in the "adapt installer" or by "automatix2" thats also installed.

I am currently running Kubuntu feisty i386 build. Do you know if there is anything already installed in this distrobution that just needs to activated or downloaded via the programs listed above?

Also I am not sure what QT is and I would imagine CLI is 'command line interface'?

Thank you again.

Kevin in Alaska

pljvaldez 05-25-2007 04:26 PM

jschiwal 05-25-2007 05:05 PM

If they are all under the same base directory, but different subdirectories, I would use the find command with the -exec md5sum '{}' \; option to calculate the md5sum of the files. A less reliable but potentially faster way could be to use normal checksums instead of calculating the md5sums.

find photodir/ -type f -iname "*.jpg" -exec md5sum '{}' \; >photolist
sort photolist >sorted-photo-list
uniq -w32 -D sorted-photo-list >dupelist

The -w32 option for uniq limits the test to the md5sum column. The -D option lists non-unique entries. Entries with the same md5sum are identical. Their names and locations may differ.

You could further process the list if you wanted to group the lists.

# Note: assumes that there are not tens of thousands of duplicates.  That would overflow bash with to many arguments in the for loop.
# Get a list of uniq md5sums in the dupelist by themselves
cut -d' ' -f1 dupelist  | uniq  >m5dupes

# cycle through the list and output all of the dupes adding an empty line between them
for md5item in $(cat m5dupes); do
grep $md5item
echo '------------'

You might want to scan through the coreutils info pages. There are a number of utilities that come in very handy in handling text files and lists. Uniq, sort, comm, grep and sed work together very nicely. I haven't really learned awk programming yet because I haven't needed to use it that often, because piping together these commands often solves the problem. But add the "Gawk: Effective AWK programming" info manual to the coreutils manual.

One command I find very handy at work is using "comm". It compares two sorted lists and prints out three columns: 1) uniq in file1 2) uniq in file2 3) common in both. You can turn off any column you want. Sed is often used to massage the items in a list, such as removing trailing spaces, before using grep or comm.


Note: I was in a hurry and haven't tested these lines of code. So some testing may be needed before you use them.

chadwick 05-25-2007 11:12 PM

Here's how I'd do it.

1) First make a backup to make sure I don't hit the wrong key by accident and delete the wrong ones:
cd photodir/..
mkdir backup/
cp -r photodir/ backup/

where photodir/ would be replaced with /home/myname/Photos in your case.

2) Double check to make sure it worked

3) Then since the file names all use the same format and you can select out the ones you want to get rid of by the hyphen plus an extra character, use wildcards:

cd photodir/..
rm -f photodir/????/??/??/photo???-?.jpg

or if the naming isn't that consistent you could do

rm -f photodir/????/??/??/*-?.jpg

4) Double check to make sure everything's okay before you start doing anything to the backup.

5) Remember you can never be too careful when using rm -f

Believe me, I nonetheless wish I could have come up with something like jschiwal's. jschiwal's doesn't assume for example that there's no file you want to keep that has a hypen followed by one character by .jpg. jschiwal's has the added safety of being certain that two files are identical before deleting one of them, but is harder to understand so perhaps easier to make a mistake.

All times are GMT -5. The time now is 02:59 AM.