hashing a tree of files

dugan · 07-05-2023, 03:11 PM

Well, what’s the use case here? If it’s to reclaim HD space then the answer would be no; you don’t care about hard links.

Skaperen · 07-05-2023, 03:35 PM

Quote:

Originally Posted by EdGr

I did that in C. Speed is required.
Ed

did that include implementation of a hashing algorithm? if so, i'd agree, speed is essential.

my intent in Python is to use the existing Python "hashlib" implementation. it did a hash of my home directory (9.54+ GB) in under 15 seconds using the SHA1 algorithm (i had the entire tree cached in RAM at that point in time and this was a measure of wall clock time).

even for things i expect to implement in C, i often find Python a nice way to prototype.

speed is important for a frequently used tool. in this case, a hash algorithm in C, assembly, microcode, or a hardware gate array, can certainly help. i believe i was running Python's copy and integration of specification C implementations.

teckk · 07-05-2023, 03:44 PM

Post #7 is an example to do exactly what you are asking for. Spits out duplicate files in 2 directories based on their hash. It will also find duplicate files in one directory. Give it one argument.

It's fairly fast because it checks the start chunk of each file, only goes farther if they match. So it doesn't have to read the whole file each time.

Make a dict of file size in bytes, and hash1k size in bytes, then a dict of full_file_hash

Code:

def check_for_duplicates(paths, hash=hashlib.sha1):
    hashes_by_size = defaultdict(list)  # dict of size_in_bytes
    hashes_on_1k = defaultdict(list)  # dict of (hash1k, size_in_bytes)
    hashes_full = {}

Get all files that have the same size, they are the collision candidates

Code:

for path in paths:
    for dirpath, dirnames, filenames in os.walk(path):
        for filename in filenames:
            full_path = os.path.join(dirpath, filename)

If the target is a symlink (soft one), this will dereference it, change the value to the actual target file

Code:

full_path = os.path.realpath(full_path)
file_size = os.path.getsize(full_path)
hashes_by_size[file_size].append(full_path)

For all files with the same file size, get their hash on the 1st 1024 bytes only

Code:

for size_in_bytes, files in hashes_by_size.items():
    if len(files) < 2:
        continue  # File size is unique, no need to spend CPU cycles on it

For all files with the hash on the 1st 1024 bytes, get their hash on the full file, collisions will be duplicates

Code:

for __, files_list in hashes_on_1k.items():
    if len(files_list) < 2:
        continue    # Hash of fist 1k file bytes is unique, no need to spend cpy cycles on it

If they match, say so

Code:

for filename in files_list:
    try: 
        full_hash = get_hash(filename, first_chunk_only=False)
        duplicate = hashes_full.get(full_hash)
        if duplicate:
            print("Duplicate found: {} and {}".format(filename, duplicate))
        else:
            hashes_full[full_hash] = filename

etc.

Alter it to suit your needs.

teckk · 07-05-2023, 03:52 PM

Here is another example, keeps a log file.

Code:

#!/usr/bin/python

#Find duplicate files in directory tree.

import sys
import os
import hashlib

def chunk_reader(fobj, chunk_size=1024):
    while True:
        chunk = fobj.read(chunk_size)
        if not chunk:
            return
        yield chunk

def check_for_duplicates(paths, hash=hashlib.sha1):
    hashes = {}
    for path in paths:
        for dirpath, dirnames, filenames in os.walk(path):
            for filename in filenames:
                full_path = os.path.join(dirpath, filename)
                hashobj = hash()
                try:
                    for chunk in chunk_reader(open(full_path, 'rb')):
                        hashobj.update(chunk)
                    file_id = (hashobj.digest(), os.path.getsize(full_path))
                    duplicate = hashes.get(file_id, None)
                    with open('Duplog.txt', 'a') as f:
                        if duplicate:
                            f.write(full_path + ' <-AND-> ' + duplicate + '\n')
                            print("Duplicate found: %s <-AND-> %s" % (full_path, duplicate))
                        else:
                            hashes[file_id] = full_path
                except (OSError,):
                    continue

if sys.argv[1:]:
    check_for_duplicates(sys.argv[1:])
else:
    print("Please pass the paths to check as parameters to the script")

Or spit out the ones that do not match on either one of those scripts.

EdGr · 07-05-2023, 04:00 PM

Quote:

Originally Posted by Skaperen

did that include implementation of a hashing algorithm? if so, i'd agree, speed is essential.

Yes, I wrote my own hash which runs a lot faster than md5sum or sha1sum for files cached in RAM (scans is ~9GB).

Code:

% time fileinfo -c scans 1> /dev/null

real	0m2.408s
user	0m1.547s
sys	0m0.861s

% time sha1sum `find scans -type f` 1> /dev/null

real	0m18.818s
user	0m17.947s
sys	0m0.872s

% time md5sum `find scans -type f` 1> /dev/null

real	0m15.041s
user	0m14.071s
sys	0m0.970s

Ed

Skaperen · 07-05-2023, 05:35 PM

Quote:

Originally Posted by EdGr

The algorithm is similar to md5sum, sort, and uniq. The code is not free.
Ed

what is the advantage of that algorithm?

Skaperen · 07-05-2023, 05:59 PM

Quote:

Originally Posted by teckk

Post that would you. If it's not too big. I want to see what approach you took.

it's big, so i'll give this link: http://ipal.net/try/list_dup_files_not_linked.py

it's not finished and could end up with big changes.

Skaperen · 07-05-2023, 06:11 PM

Quote:

Originally Posted by Skaperen

it's big, so i'll give this link: http://ipal.net/try/list_dup_files_not_linked.py

it's not finished and could end up with big changes.

if you want to try to run this, you might need my file tree walker generator at http://ipal.net/python/ftrgen.py

EdGr · 07-05-2023, 07:50 PM

Quote:

Originally Posted by Skaperen

what is the advantage of that algorithm?

The files don't need to be compared. If the 128-bit checksums match, the files are assumed to be identical.

The probability of a false positive is far below the hardware soft error rate, and the time required is far above the age of the universe.

Ed

dugan · 07-05-2023, 09:13 PM

Quote:

Originally Posted by Skaperen

if you want to try to run this, you might need my file tree walker generator at http://ipal.net/python/ftrgen.py

I'm not sure why you wrote that when os.walk exists.

Skaperen · 07-06-2023, 01:12 PM

Quote:

Originally Posted by EdGr

The files don't need to be compared. If the 128-bit checksums match, the files are assumed to be identical.

The probability of a false positive is far below the hardware soft error rate, and the time required is far above the age of the universe.

Ed

sorry, i meant what is the advantage of your particular algorithm invention over others like md5 and sha1?

Skaperen · 07-06-2023, 01:18 PM

Quote:

Originally Posted by dugan

I'm not sure why you wrote that when os.walk exists.

i just knew that if i posted sources, someone would slide off topic. i should have known it would be you. if you really want to know why, you should ask on a new thread and refer me to it. if you don't do that, i'll assume you don't really care. my bet is that you don't really care.

dugan · 07-06-2023, 02:36 PM

I have not gotten off topic. Get a grip.

And your knee-jerk refusal to answer was an incredibly obvious way to say “I didn’t know about it” while trying to save face.

Now, do you want to do this right, which my question was absolutely on-topic and relevant to, or do you want to talk around it in ways that preserve your ego and avoid acknowledging that you did anything wrong?

EdGr · 07-06-2023, 02:59 PM

Quote:

Originally Posted by Skaperen

sorry, i meant what is the advantage of your particular algorithm invention over others like md5 and sha1?

I don't need a secure hash.

md5 and sha1 have broken security anyway.

BTW, this thread has caused me to re-examine hash algorithms.
Ed

pan64 · 07-07-2023, 01:07 AM

I'm totally lost. What is the goal now?