Identifying similar files

Z038 · 09-07-2013, 05:08 PM

I used photorec to try to recover some files that I had accidentally deleted (shift+delete) from my home directory. I ended up with nearly 50,000 files found by photorec that I need to sift through to identify the few dozen I care about. I was able to narrow down the list of likely candidates to under 8,000 files based on the file extensions that photorec assigned. Then I eliminated files that were 100% duplicates from that population and got the list down to 5,800 files.

The files are text, so I may be able to weed the list down by a couple thousand more through careful use of grep, but I'm sure I won't find all the files I'm interested in that way. Even if I could, a bigger problem is that there may be dozens or even hundreds of revisions of a file, and I only want the last version of it. In one case, I found 373 revisions of a target file. I was able to find the last version of it only because I recently edited it, and I remembered a unique string that only occurred in that final version. I won't be so lucky in other cases.

So I'm looking for ideas on how to approach this data reduction problem. I was thinking that if I could identify files that are similar enough to each other, differing only in small detail, then they might fall into groups that I could view or process further in hopes of isolating the best version to save.

For implementation, either piped commands, bash scripts, or a C++ program are within the realm of possibility for me, although I'm not an ace with any of them. But before I can start on an implementation, I need help with the approach.

The basic goal is to group files by degree of similarity. If I can do that, I should be able to quickly reduce the number of files significantly just by viewing a few files from each group. I welcome all ideas on how to accomplish this. Ideas about other ways of approaching the problem would be welcome too.

In case anyone may be interested, I found on a blog here a masterful set of piped commands that I used at the command line to identify and delete duplicate files. The person who came up with this clearly has superior command-line-foo! It's way beyond my meager capabilities. Unfortunately, I don't think it will be very easy to adapt this to identifying near duplicate files and grouping them.

Report on duplicate files:

Code:

find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate

Delete duplicate files. Remove the "p" flag from the last xargs command to skip the confirmation prompt for each deletion.

Code:

find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d |  xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate | cut -f3-100 -d ' ' | tr '\n.' '\t.' | sed 's/\t\t/\n/g' | cut -f2-100 | tr '\t' '\n' | perl -i -pe 's/([ (){}-])/\\$1/g' | perl -i -pe 's/'\''/\\'\''/g' | xargs -pr rm -v

unSpawn · 09-07-2013, 06:50 PM

This report on duplicate files:

Code:

find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 \
| xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate

could be replaced by this:

Code:

md5deep -r /some/path|sort|uniq -w32 --all-repeated=separate

Quote:

Originally Posted by Z038

(..)The basic goal is to group files by degree of similarity.

Interesting question. Apart from the realm of humongous one-liners there's actually tools to test text similarity like Sherlock, Similarity-utils, Simhash and SIM (you probably don't want SimSearch or phash) but I can't say how useful they are as I haven't tested them. I probably may need some of them myself.

Z038 · 09-07-2013, 09:11 PM

Thanks! I don't have md5deep on my system, but I'm going to install it. It looks quite interesting.

And thank you most especially for the links to Sherlock and the others utilities. They sound like they might be just what I need. I'll install them and check them out.

I found a paper written in 1993 by Udi Manber, http://webglimpse.net/pubs/TR93-33.pdf, that discusses an approach to solving this sort of problem. Sherlock and perhaps some of the other programs seem like they may be based on this type of approach.

bigearsbilly · 09-09-2013, 09:29 AM

you can use cksum too, or md5sum.

I've got a perl script to do it if you like.

Z038 · 09-10-2013, 10:49 AM

bigearsbilly, does your method find similar files, or exact duplicates? I can find exact duplicates easily using the command line one-liner I posted previously.

What I need to do is compare and group files that are different from each other, but similar enough that they have a high probability of being the same file with only minor edits differentiating them.

I'm using the Sherlock program that unSpawn pointed me. It is working pretty well for me, although it is still a laborious effort to sift through the dozens of similar files it identifies to try to figure out which one is the best version to save.

bigearsbilly · 09-11-2013, 04:03 AM

Aah no mind finds exact duplicates and outputs a script to optionally remove them.
I use it for mp3s.

You will probably need the human eyeball.

firstfire · 09-11-2013, 11:12 AM

Hi.

Here is my approach (if you want to try something different):

Clusterize files on their size, as two files with size difference of 1M are hardly similar. Let's call this size threshold sz_th. So, we form clusters of files of similar size.
Divide each cluster into subclusters, using some relatively cheap 'approximate hash' approach (e.g. simhash). Denote maximum distance between files in one cluster (in simhash metric) by sim_th.
Finally perform pairwise comparison of similarity in sub-sub-clusters using, e.g. Levenshtein.ratio() and display only pairs with similarity >=ratio_th. This step may take long time if big binary files are considered.

In other words, reduce number of variants as much as possible.

Here is the pyton code:

Code:

#!/usr/bin/env python

from hashes.simhash import simhash
import Levenshtein
from scipy.spatial.distance import pdist
from scipy.cluster.hierarchy import fcluster
from fastcluster import linkage
from pprint import pprint
import sys, glob, os

def make_clusters(cl, files):
    clusters = { i:[] for i in xrange(1, max(cl)+1)}
    for i,n in enumerate(cl):
        clusters[n] += [ files[i] ]
    return [ c for c in clusters.values() if len(c)>1 ]

class File:
    simhash = None
    size = None
    def __init__(self, name):
        self.name = name
    def __repr__(self):
        s = '<' + self.name
        if self.size:
            s += '; sz='+ str(self.size)
        if self.simhash:
            s += '; hash='+ str(self.simhash)
        s += '>'
        return s
    def get_size(self):
        if not self.size:
            self.size = os.stat(self.name).st_size
        return self.size
    def get_simhash(self):
        if not self.simhash:
            s = self.get_str()
            self.simhash = simhash(s)
            del s
        return self.simhash
    def get_str(self):
        # may unzip etc here
        return open(self.name).read()

def cluster_on_size(files, th=1e4):
    if len(files) < 2: return []
    sz = [ [f.get_size()] for f in files]
    dm = pdist( sz, 'minkowski', 1)
    link = linkage(dm, preserve_input=False, method='complete')
    cl = fcluster(link, t=th, criterion='distance')
    return make_clusters(cl, files)

def cluster_on_simhash(files, th=0.2):
    if len(files) < 2: return []

    def sh_dist(i, j):
        h1 = files[int(i)].get_simhash()
        h2 = files[int(j)].get_simhash()
        # dist b/w similar files is small:
        return 1 - h1.similarity(h2)

    sh = [ [i] for i in range( len(files) ) ]
    dm = pdist( sh, sh_dist )

    link = linkage( dm, method='complete')
    cl = fcluster(link, t=th, criterion='distance')
    return make_clusters(cl, files)

def main(argv):
    if len(argv) == 1:
        print "Usage:", argv[0], "glob1 ..."
        return 1
    
    sz_th  = 1e3
    sim_th = 0.2
    ratio_th = 0.8

    files = [ File(fn) for g in argv[1:] for fn in glob.glob(g) if os.path.isfile(fn)]
    print "Processing", len(files), 'files'
    sz_clusters = cluster_on_size(files, sz_th)
    print "{} size-clusters formed (sz_th={} bytes, sim_th={}%, ratio_th={}%)".\
            format(len(sz_clusters), sz_th, sim_th*100, ratio_th*100)

    res = []
    for n, cl in enumerate(sz_clusters):
        print "cluster {} ({} files)".format(n, len(cl) )
        res += cluster_on_simhash(cl, sim_th)

    # sort clusters in ascending order of file sizes
    res = sorted(res, key=lambda c: c[0].size)

    pprint(res)

    print 'Compute pairwise similarity. This may take... eternity'

    for c in res:
        for n,a in enumerate(c):
            for b in c[n+1:]:
                #print 'checking', a.name, b.name
                r = Levenshtein.ratio(a.get_str(), b.get_str())
                if r >= ratio_th:
                    print '{:0.3f} {} {}'.format(r, a.name, b.name)


if __name__ == "__main__":
    sys.exit(main(sys.argv))

Dependencies: scipy, fastcluster, python-Levenshtein, python-hashes. To use this scipt, install dependencies (I used virtualenv+`pip install package-name'), make the script executable using `chmod +x filename.py' and run with files of interest as arguments. You may also need to tweak sz_th, sim_th, ratio_th parameters.

Of course it is slower than Sherlock

But you have the ability to play with different metrics and nonzero probability that program finishes in near future

Good luck.

Z038 · 09-11-2013, 05:30 PM

firstfire, that looks like an excellent approach. I will give your code a try. Thanks very much for posting it.

simon67 · 02-14-2014, 06:33 AM

I used DuplicateFilesDeleter to find and remove duplicate files .

pan64 · 02-14-2014, 07:34 AM

see here: https://www.linuxquestions.org/quest...-files-591513/

alexias · 11-21-2014, 03:10 AM

DuplicateFilesDeleter did a great job with me and my employee. It works very well and removes duplicate files quickly and very best result oriented solution.