LinuxQuestions.org
Visit Jeremy's Blog.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 09-07-2013, 05:08 PM   #1
Z038
Member
 
Registered: Jan 2006
Location: Dallas
Distribution: Slackware
Posts: 910

Rep: Reputation: 174Reputation: 174
Identifying similar files


I used photorec to try to recover some files that I had accidentally deleted (shift+delete) from my home directory. I ended up with nearly 50,000 files found by photorec that I need to sift through to identify the few dozen I care about. I was able to narrow down the list of likely candidates to under 8,000 files based on the file extensions that photorec assigned. Then I eliminated files that were 100% duplicates from that population and got the list down to 5,800 files.

The files are text, so I may be able to weed the list down by a couple thousand more through careful use of grep, but I'm sure I won't find all the files I'm interested in that way. Even if I could, a bigger problem is that there may be dozens or even hundreds of revisions of a file, and I only want the last version of it. In one case, I found 373 revisions of a target file. I was able to find the last version of it only because I recently edited it, and I remembered a unique string that only occurred in that final version. I won't be so lucky in other cases.

So I'm looking for ideas on how to approach this data reduction problem. I was thinking that if I could identify files that are similar enough to each other, differing only in small detail, then they might fall into groups that I could view or process further in hopes of isolating the best version to save.

For implementation, either piped commands, bash scripts, or a C++ program are within the realm of possibility for me, although I'm not an ace with any of them. But before I can start on an implementation, I need help with the approach.

The basic goal is to group files by degree of similarity. If I can do that, I should be able to quickly reduce the number of files significantly just by viewing a few files from each group. I welcome all ideas on how to accomplish this. Ideas about other ways of approaching the problem would be welcome too.

In case anyone may be interested, I found on a blog here a masterful set of piped commands that I used at the command line to identify and delete duplicate files. The person who came up with this clearly has superior command-line-foo! It's way beyond my meager capabilities. Unfortunately, I don't think it will be very easy to adapt this to identifying near duplicate files and grouping them.

Report on duplicate files:

Code:
find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate
Delete duplicate files. Remove the "p" flag from the last xargs command to skip the confirmation prompt for each deletion.

Code:
find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d |  xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate | cut -f3-100 -d ' ' | tr '\n.' '\t.' | sed 's/\t\t/\n/g' | cut -f2-100 | tr '\t' '\n' | perl -i -pe 's/([ (){}-])/\\$1/g' | perl -i -pe 's/'\''/\\'\''/g' | xargs -pr rm -v
 
Old 09-07-2013, 06:50 PM   #2
unSpawn
Moderator
 
Registered: May 2001
Posts: 29,415
Blog Entries: 55

Rep: Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600
This report on duplicate files:
Code:
find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 \
| xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate
could be replaced by this:
Code:
md5deep -r /some/path|sort|uniq -w32 --all-repeated=separate

Quote:
Originally Posted by Z038 View Post
(..)The basic goal is to group files by degree of similarity.
Interesting question. Apart from the realm of humongous one-liners there's actually tools to test text similarity like Sherlock, Similarity-utils, Simhash and SIM (you probably don't want SimSearch or phash) but I can't say how useful they are as I haven't tested them. I probably may need some of them myself.
 
1 members found this post helpful.
Old 09-07-2013, 09:11 PM   #3
Z038
Member
 
Registered: Jan 2006
Location: Dallas
Distribution: Slackware
Posts: 910

Original Poster
Rep: Reputation: 174Reputation: 174
Thanks! I don't have md5deep on my system, but I'm going to install it. It looks quite interesting.

And thank you most especially for the links to Sherlock and the others utilities. They sound like they might be just what I need. I'll install them and check them out.

I found a paper written in 1993 by Udi Manber, http://webglimpse.net/pubs/TR93-33.pdf, that discusses an approach to solving this sort of problem. Sherlock and perhaps some of the other programs seem like they may be based on this type of approach.
 
Old 09-09-2013, 09:29 AM   #4
bigearsbilly
Senior Member
 
Registered: Mar 2004
Location: england
Distribution: Mint, Armbian, NetBSD, Puppy, Raspbian
Posts: 3,515

Rep: Reputation: 239Reputation: 239Reputation: 239
you can use cksum too, or md5sum.

I've got a perl script to do it if you like.
 
Old 09-10-2013, 10:49 AM   #5
Z038
Member
 
Registered: Jan 2006
Location: Dallas
Distribution: Slackware
Posts: 910

Original Poster
Rep: Reputation: 174Reputation: 174
bigearsbilly, does your method find similar files, or exact duplicates? I can find exact duplicates easily using the command line one-liner I posted previously.

What I need to do is compare and group files that are different from each other, but similar enough that they have a high probability of being the same file with only minor edits differentiating them.

I'm using the Sherlock program that unSpawn pointed me. It is working pretty well for me, although it is still a laborious effort to sift through the dozens of similar files it identifies to try to figure out which one is the best version to save.
 
Old 09-11-2013, 04:03 AM   #6
bigearsbilly
Senior Member
 
Registered: Mar 2004
Location: england
Distribution: Mint, Armbian, NetBSD, Puppy, Raspbian
Posts: 3,515

Rep: Reputation: 239Reputation: 239Reputation: 239
Aah no mind finds exact duplicates and outputs a script to optionally remove them.
I use it for mp3s.

You will probably need the human eyeball.
 
Old 09-11-2013, 11:12 AM   #7
firstfire
Member
 
Registered: Mar 2006
Location: Ekaterinburg, Russia
Distribution: Debian, Ubuntu
Posts: 709

Rep: Reputation: 428Reputation: 428Reputation: 428Reputation: 428Reputation: 428
Hi.

Here is my approach (if you want to try something different):
  1. Clusterize files on their size, as two files with size difference of 1M are hardly similar. Let's call this size threshold sz_th. So, we form clusters of files of similar size.
  2. Divide each cluster into subclusters, using some relatively cheap 'approximate hash' approach (e.g. simhash). Denote maximum distance between files in one cluster (in simhash metric) by sim_th.
  3. Finally perform pairwise comparison of similarity in sub-sub-clusters using, e.g. Levenshtein.ratio() and display only pairs with similarity >=ratio_th. This step may take long time if big binary files are considered.
In other words, reduce number of variants as much as possible.

Here is the pyton code:
Code:
#!/usr/bin/env python

from hashes.simhash import simhash
import Levenshtein
from scipy.spatial.distance import pdist
from scipy.cluster.hierarchy import fcluster
from fastcluster import linkage
from pprint import pprint
import sys, glob, os

def make_clusters(cl, files):
    clusters = { i:[] for i in xrange(1, max(cl)+1)}
    for i,n in enumerate(cl):
        clusters[n] += [ files[i] ]
    return [ c for c in clusters.values() if len(c)>1 ]

class File:
    simhash = None
    size = None
    def __init__(self, name):
        self.name = name
    def __repr__(self):
        s = '<' + self.name
        if self.size:
            s += '; sz='+ str(self.size)
        if self.simhash:
            s += '; hash='+ str(self.simhash)
        s += '>'
        return s
    def get_size(self):
        if not self.size:
            self.size = os.stat(self.name).st_size
        return self.size
    def get_simhash(self):
        if not self.simhash:
            s = self.get_str()
            self.simhash = simhash(s)
            del s
        return self.simhash
    def get_str(self):
        # may unzip etc here
        return open(self.name).read()

def cluster_on_size(files, th=1e4):
    if len(files) < 2: return []
    sz = [ [f.get_size()] for f in files]
    dm = pdist( sz, 'minkowski', 1)
    link = linkage(dm, preserve_input=False, method='complete')
    cl = fcluster(link, t=th, criterion='distance')
    return make_clusters(cl, files)

def cluster_on_simhash(files, th=0.2):
    if len(files) < 2: return []

    def sh_dist(i, j):
        h1 = files[int(i)].get_simhash()
        h2 = files[int(j)].get_simhash()
        # dist b/w similar files is small:
        return 1 - h1.similarity(h2)

    sh = [ [i] for i in range( len(files) ) ]
    dm = pdist( sh, sh_dist )

    link = linkage( dm, method='complete')
    cl = fcluster(link, t=th, criterion='distance')
    return make_clusters(cl, files)

def main(argv):
    if len(argv) == 1:
        print "Usage:", argv[0], "glob1 ..."
        return 1
    
    sz_th  = 1e3
    sim_th = 0.2
    ratio_th = 0.8

    files = [ File(fn) for g in argv[1:] for fn in glob.glob(g) if os.path.isfile(fn)]
    print "Processing", len(files), 'files'
    sz_clusters = cluster_on_size(files, sz_th)
    print "{} size-clusters formed (sz_th={} bytes, sim_th={}%, ratio_th={}%)".\
            format(len(sz_clusters), sz_th, sim_th*100, ratio_th*100)

    res = []
    for n, cl in enumerate(sz_clusters):
        print "cluster {} ({} files)".format(n, len(cl) )
        res += cluster_on_simhash(cl, sim_th)

    # sort clusters in ascending order of file sizes
    res = sorted(res, key=lambda c: c[0].size)

    pprint(res)

    print 'Compute pairwise similarity. This may take... eternity'

    for c in res:
        for n,a in enumerate(c):
            for b in c[n+1:]:
                #print 'checking', a.name, b.name
                r = Levenshtein.ratio(a.get_str(), b.get_str())
                if r >= ratio_th:
                    print '{:0.3f} {} {}'.format(r, a.name, b.name)


if __name__ == "__main__":
    sys.exit(main(sys.argv))
Dependencies: scipy, fastcluster, python-Levenshtein, python-hashes. To use this scipt, install dependencies (I used virtualenv+`pip install package-name'), make the script executable using `chmod +x filename.py' and run with files of interest as arguments. You may also need to tweak sz_th, sim_th, ratio_th parameters.

Of course it is slower than Sherlock But you have the ability to play with different metrics and nonzero probability that program finishes in near future

Good luck.

Last edited by firstfire; 09-11-2013 at 11:18 AM. Reason: Fix a typo.
 
2 members found this post helpful.
Old 09-11-2013, 05:30 PM   #8
Z038
Member
 
Registered: Jan 2006
Location: Dallas
Distribution: Slackware
Posts: 910

Original Poster
Rep: Reputation: 174Reputation: 174
firstfire, that looks like an excellent approach. I will give your code a try. Thanks very much for posting it.
 
Old 02-14-2014, 06:33 AM   #9
simon67
LQ Newbie
 
Registered: Feb 2014
Posts: 1

Rep: Reputation: Disabled
Duplicate files

I used DuplicateFilesDeleter to find and remove duplicate files .
 
Old 02-14-2014, 07:34 AM   #10
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,840

Rep: Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308
see here: https://www.linuxquestions.org/quest...-files-591513/
 
Old 11-21-2014, 03:10 AM   #11
alexias
LQ Newbie
 
Registered: Nov 2014
Posts: 1

Rep: Reputation: Disabled
DuplicateFilesDeleter did a great job with me and my employee. It works very well and removes duplicate files quickly and very best result oriented solution.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Identifying Text and Human Readable Data Files devUnix Linux - General 1 07-01-2011 03:57 AM
Shell Script for Identifying the file and zip all files, move the files to Target Dir gvenkat Linux - Newbie 3 05-07-2011 10:53 AM
Identifying attached files. SkipHuffman Linux - General 5 11-11-2006 06:59 PM
Identifying files in bash script Cruger Programming 7 03-15-2004 10:49 AM
"batch files" in red hat, similar to .bat files in dos shycalais Linux - Newbie 2 10-12-2003 08:34 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 04:54 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration