hashing a tree of files

Skaperen · 07-04-2023, 07:14 PM

using Python3, i want to hash the contents of a tree of files to verify that all the files in one tree are the same as all the files in another tree even if the files are in random order, as long as the names are correct with the correct content matching the names. what can you suggest?

TB0ne · 07-04-2023, 07:23 PM

Quote:

Originally Posted by Skaperen

using Python3, i want to hash the contents of a tree of files to verify that all the files in one tree are the same as all the files in another tree even if the files are in random order, as long as the names are correct with the correct content matching the names. what can you suggest?

I'd sort the files by name/path, MD5 each one, and store the pair in an array. Do same for other tree, then compare the two arrays. Depending on the number of files, of course; a few hundred should be a huge problem, but if you're talking thousands, it could take a bit, or may require a temp file. Loads of ways to approach.

chrism01 · 07-04-2023, 08:21 PM

I'd use hashes in Perl for a couple of reasons:

1. no sorting reqd.

2. easier to deal with both lists not being identical ie files in list A not in list B and vice versa.

NevemTeve · 07-04-2023, 09:49 PM

You can compare the content of two files without calculating any hash. Hint: cmp(1)

dugan · 07-04-2023, 10:35 PM

Or “diff -b”

I’d iterate over the” tree, looks for each file in “the other tree” with the same filename, and do the comparison. The code for that is fairly obvious.

One good speed optimization is to not even look at the file contents unless the files have the same sizes. Reading sizes is a very fast operation.

pan64 · 07-05-2023, 02:13 AM

dupfinder?

teckk · 07-05-2023, 10:20 AM

Quote:

using Python3, i want to hash the contents of a tree of files to verify that all the files in one tree are the same

findDup.py

Code:

#!/usr/bin/python

from collections import defaultdict
import hashlib
import os
import sys

def chunk_reader(fobj, chunk_size=1024):
    while True:
        chunk = fobj.read(chunk_size)
        if not chunk:
            return
        yield chunk

def get_hash(filename, first_chunk_only=False, hash=hashlib.sha1):
    hashobj = hash()
    file_object = open(filename, 'rb')

    if first_chunk_only:
        hashobj.update(file_object.read(1024))
    else:
        for chunk in chunk_reader(file_object):
            hashobj.update(chunk)
    hashed = hashobj.digest()

    file_object.close()
    return hashed

def check_for_duplicates(paths, hash=hashlib.sha1):
    hashes_by_size = defaultdict(list)
    hashes_on_1k = defaultdict(list)
    hashes_full = {}

    for path in paths:
        for dirpath, dirnames, filenames in os.walk(path):
            for filename in filenames:
                full_path = os.path.join(dirpath, filename)
                try:
                    full_path = os.path.realpath(full_path)
                    file_size = os.path.getsize(full_path)
                    hashes_by_size[file_size].append(full_path)
                except (OSError,):
                    continue

    for size_in_bytes, files in hashes_by_size.items():
        if len(files) < 2:
            continue

        for filename in files:
            try:
                small_hash = get_hash(filename, first_chunk_only=True)
                hashes_on_1k[(small_hash, size_in_bytes)].append(filename)
            except (OSError,):
                continue

    for __, files_list in hashes_on_1k.items():
        if len(files_list) < 2:
            continue

        for filename in files_list:
            try: 
                full_hash = get_hash(filename, first_chunk_only=False)
                duplicate = hashes_full.get(full_hash)
                if duplicate:
                    print("Duplicate found: {} and {}".format(filename, duplicate))
                else:
                    hashes_full[full_hash] = filename
            except (OSError,):
                continue
                
if __name__ == "__main__":
    if sys.argv[1:]:
        check_for_duplicates(sys.argv[1:])
    else:
        print("Please pass the paths to check as parameters to the script")

And that would be run with path arguments

Code:

python ./findDup.py /path/to/dir1 /path/to/dir2

EdGr · 07-05-2023, 11:29 AM

I did that in C. Speed is required.
Ed

teckk · 07-05-2023, 11:47 AM

Post that would you. If it's not too big. I want to see what approach you took.

EdGr · 07-05-2023, 11:55 AM

The algorithm is similar to md5sum, sort, and uniq. The code is not free.
Ed

Skaperen · 07-05-2023, 02:42 PM

Quote:

Originally Posted by NevemTeve

You can compare the content of two files without calculating any hash. Hint: cmp(1)

how do i use cmp(1) to compare every file (in a directory or tree or whatever) to all the others? the reason i would use a hash is so i don't have to read a file again every time i need to compare it to another file.

dugan · 07-05-2023, 02:49 PM

With loops. Duh.

And you can have your code memoize the comparison results. That means that you store them so that you know not to calculate them twice.

That said, your idea of using hashes would work well and you should just go ahead and code it.

Skaperen · 07-05-2023, 02:55 PM

Quote:

Originally Posted by dugan

Or “diff -b”

I’d iterate over the” tree, looks for each file in “the other tree” with the same filename, and do the comparison. The code for that is fairly obvious.

One good speed optimization is to not even look at the file contents unless the files have the same sizes. Reading sizes is a very fast operation.

the reason i would use a hash is to be able to "compare" the file to many other files. once i have a digest of its contents, comparing digests is nearly as trivial as comparing sizes. compare two digests completes as "unequal" if the first part (potentially 32 bits or 64 bits) is unequal. ISTM that at this point, comparing size is a waste of time though if i were using C i might find a way to integrate such comparisons together.

i am curious how “diff -b” adds to any of this.

NevemTeve · 07-05-2023, 02:58 PM

As a start, show the current state of your program.

Skaperen · 07-05-2023, 03:08 PM

Quote:

Originally Posted by pan64

dupfinder?

all i find online is a Windows version. my project is for Linux. i do not find a Linux version in the online search results (first 25 results).

i would be curious if it considers a file hard linked under 2 or more names to be duplicates or not. what i have put together, so far, does not. it does this by recording inodes as file "names", although it also records all links and shows (hard) links of each inode in the results.