ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I used photorec to try to recover some files that I had accidentally deleted (shift+delete) from my home directory. I ended up with nearly 50,000 files found by photorec that I need to sift through to identify the few dozen I care about. I was able to narrow down the list of likely candidates to under 8,000 files based on the file extensions that photorec assigned. Then I eliminated files that were 100% duplicates from that population and got the list down to 5,800 files.
The files are text, so I may be able to weed the list down by a couple thousand more through careful use of grep, but I'm sure I won't find all the files I'm interested in that way. Even if I could, a bigger problem is that there may be dozens or even hundreds of revisions of a file, and I only want the last version of it. In one case, I found 373 revisions of a target file. I was able to find the last version of it only because I recently edited it, and I remembered a unique string that only occurred in that final version. I won't be so lucky in other cases.
So I'm looking for ideas on how to approach this data reduction problem. I was thinking that if I could identify files that are similar enough to each other, differing only in small detail, then they might fall into groups that I could view or process further in hopes of isolating the best version to save.
For implementation, either piped commands, bash scripts, or a C++ program are within the realm of possibility for me, although I'm not an ace with any of them. But before I can start on an implementation, I need help with the approach.
The basic goal is to group files by degree of similarity. If I can do that, I should be able to quickly reduce the number of files significantly just by viewing a few files from each group. I welcome all ideas on how to accomplish this. Ideas about other ways of approaching the problem would be welcome too.
In case anyone may be interested, I found on a blog here a masterful set of piped commands that I used at the command line to identify and delete duplicate files. The person who came up with this clearly has superior command-line-foo! It's way beyond my meager capabilities. Unfortunately, I don't think it will be very easy to adapt this to identifying near duplicate files and grouping them.
(..)The basic goal is to group files by degree of similarity.
Interesting question. Apart from the realm of humongous one-liners there's actually tools to test text similarity like Sherlock, Similarity-utils, Simhash and SIM (you probably don't want SimSearch or phash) but I can't say how useful they are as I haven't tested them. I probably may need some of them myself.
Thanks! I don't have md5deep on my system, but I'm going to install it. It looks quite interesting.
And thank you most especially for the links to Sherlock and the others utilities. They sound like they might be just what I need. I'll install them and check them out.
I found a paper written in 1993 by Udi Manber, http://webglimpse.net/pubs/TR93-33.pdf, that discusses an approach to solving this sort of problem. Sherlock and perhaps some of the other programs seem like they may be based on this type of approach.
bigearsbilly, does your method find similar files, or exact duplicates? I can find exact duplicates easily using the command line one-liner I posted previously.
What I need to do is compare and group files that are different from each other, but similar enough that they have a high probability of being the same file with only minor edits differentiating them.
I'm using the Sherlock program that unSpawn pointed me. It is working pretty well for me, although it is still a laborious effort to sift through the dozens of similar files it identifies to try to figure out which one is the best version to save.
Here is my approach (if you want to try something different):
Clusterize files on their size, as two files with size difference of 1M are hardly similar. Let's call this size threshold sz_th. So, we form clusters of files of similar size.
Divide each cluster into subclusters, using some relatively cheap 'approximate hash' approach (e.g. simhash). Denote maximum distance between files in one cluster (in simhash metric) by sim_th.
Finally perform pairwise comparison of similarity in sub-sub-clusters using, e.g. Levenshtein.ratio() and display only pairs with similarity >=ratio_th. This step may take long time if big binary files are considered.
In other words, reduce number of variants as much as possible.
Here is the pyton code:
Code:
#!/usr/bin/env python
from hashes.simhash import simhash
import Levenshtein
from scipy.spatial.distance import pdist
from scipy.cluster.hierarchy import fcluster
from fastcluster import linkage
from pprint import pprint
import sys, glob, os
def make_clusters(cl, files):
clusters = { i:[] for i in xrange(1, max(cl)+1)}
for i,n in enumerate(cl):
clusters[n] += [ files[i] ]
return [ c for c in clusters.values() if len(c)>1 ]
class File:
simhash = None
size = None
def __init__(self, name):
self.name = name
def __repr__(self):
s = '<' + self.name
if self.size:
s += '; sz='+ str(self.size)
if self.simhash:
s += '; hash='+ str(self.simhash)
s += '>'
return s
def get_size(self):
if not self.size:
self.size = os.stat(self.name).st_size
return self.size
def get_simhash(self):
if not self.simhash:
s = self.get_str()
self.simhash = simhash(s)
del s
return self.simhash
def get_str(self):
# may unzip etc here
return open(self.name).read()
def cluster_on_size(files, th=1e4):
if len(files) < 2: return []
sz = [ [f.get_size()] for f in files]
dm = pdist( sz, 'minkowski', 1)
link = linkage(dm, preserve_input=False, method='complete')
cl = fcluster(link, t=th, criterion='distance')
return make_clusters(cl, files)
def cluster_on_simhash(files, th=0.2):
if len(files) < 2: return []
def sh_dist(i, j):
h1 = files[int(i)].get_simhash()
h2 = files[int(j)].get_simhash()
# dist b/w similar files is small:
return 1 - h1.similarity(h2)
sh = [ [i] for i in range( len(files) ) ]
dm = pdist( sh, sh_dist )
link = linkage( dm, method='complete')
cl = fcluster(link, t=th, criterion='distance')
return make_clusters(cl, files)
def main(argv):
if len(argv) == 1:
print "Usage:", argv[0], "glob1 ..."
return 1
sz_th = 1e3
sim_th = 0.2
ratio_th = 0.8
files = [ File(fn) for g in argv[1:] for fn in glob.glob(g) if os.path.isfile(fn)]
print "Processing", len(files), 'files'
sz_clusters = cluster_on_size(files, sz_th)
print "{} size-clusters formed (sz_th={} bytes, sim_th={}%, ratio_th={}%)".\
format(len(sz_clusters), sz_th, sim_th*100, ratio_th*100)
res = []
for n, cl in enumerate(sz_clusters):
print "cluster {} ({} files)".format(n, len(cl) )
res += cluster_on_simhash(cl, sim_th)
# sort clusters in ascending order of file sizes
res = sorted(res, key=lambda c: c[0].size)
pprint(res)
print 'Compute pairwise similarity. This may take... eternity'
for c in res:
for n,a in enumerate(c):
for b in c[n+1:]:
#print 'checking', a.name, b.name
r = Levenshtein.ratio(a.get_str(), b.get_str())
if r >= ratio_th:
print '{:0.3f} {} {}'.format(r, a.name, b.name)
if __name__ == "__main__":
sys.exit(main(sys.argv))
Dependencies: scipy, fastcluster, python-Levenshtein, python-hashes. To use this scipt, install dependencies (I used virtualenv+`pip install package-name'), make the script executable using `chmod +x filename.py' and run with files of interest as arguments. You may also need to tweak sz_th, sim_th, ratio_th parameters.
Of course it is slower than Sherlock But you have the ability to play with different metrics and nonzero probability that program finishes in near future
Good luck.
Last edited by firstfire; 09-11-2013 at 11:18 AM.
Reason: Fix a typo.
DuplicateFilesDeleter did a great job with me and my employee. It works very well and removes duplicate files quickly and very best result oriented solution.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.