hashing a tree of files

Guttorm · 07-07-2023, 06:57 AM

Hi

There's a lot of ways to solve this. Recently, I have started using rsync with --dry-run and verbose. The reason is it has tons of options, it's very fast and it's well tested and documented. Like what about devices, soft/hard links, sparse/not sparse files, file ownership/permissions/ACL and so on. It works over ssh with bwlimit if I want to. And do I really need to compare/hash the files when size and date is the same? It's easy to play with all the options, like --checksum --checksum-choice and so on. But it's not written in python.

dugan · 07-07-2023, 07:47 AM

Quote:

Originally Posted by Skaperen

i just knew that if i posted sources, someone would slide off topic.

I can't get over how strange a comment this is. It is not "off-topic" to talk about the code that you wrote to solve the problem being discussed.

Skaperen · 07-07-2023, 01:10 PM

Quote:

Originally Posted by dugan

I have not gotten off topic. Get a grip.

And your knee-jerk refusal to answer was an incredibly obvious way to say “I didn’t know about it” while trying to save face.

Now, do you want to do this right, which my question was absolutely on-topic and relevant to, or do you want to talk around it in ways that preserve your ego and avoid acknowledging that you did anything wrong?

it is exploring the concept of hashing an entire tree to do file compares relative to another tree. comparing trees is not exactly what i did but is close. but i think the concept of comparing trees is simple enough to deal with hashing an entire tree.

which specific question do you want to get a specific answer to?

Skaperen · 07-07-2023, 01:23 PM

Quote:

Originally Posted by EdGr

I don't need a secure hash.

for this project, i might not need one, either.

OTOH, someone might try to use any of this in some context that needs strong security.

Quote:

Originally Posted by EdGr

md5 and sha1 have broken security anyway.

they do, indeed. but, for what i'm doing, i don't need security, unless someone is trying to give me false results. if they can put in files to try that, then i have other issues.

Quote:

Originally Posted by EdGr

BTW, this thread has caused me to re-examine hash algorithms.

but you invented a high speed, proprietary, one?

Skaperen · 07-07-2023, 01:27 PM

Quote:

Originally Posted by pan64

I'm totally lost. What is the goal now?

my goal is to see what people who have probably worked with hashing do to deal with multiple and groupings of hashes. it's a learning goal.

Skaperen · 07-07-2023, 01:35 PM

Quote:

Originally Posted by dugan

I can't get over how strange a comment this is. It is not "off-topic" to talk about the code that you wrote to solve the problem being discussed.

yes, it is.

the original topic does not change unless OP explicitly expands it. other poster could expand it, too, if they give a good explanation why.

you should start a new thread. or are you fearing a lawsuit from one of those social media titans?

dugan · 07-07-2023, 02:08 PM

Quote:

Originally Posted by Skaperen

my goal is to see what people who have probably worked with hashing do to deal with multiple and groupings of hashes. it's a learning goal.

Sure, we can do that.

Here's mine. My hash is just the file size. As I've said, it's the fastest possible hash. Why? Because it's just filesystem metadata. It takes the same amount of time to "calculate" (e.g. read) no matter how large the file is.

https://gist.github.com/duganchen/1e917c11fce44267b4c4

The reason that hash is good enough is because the point is to come up with a short-list to eyeball. Which you'd be doing anyway even if you were using a slower hash that results in fewer collisions. The script takes a few minutes to go through a couple of TB.

The structure it uses to group hashes is:

Code:

{
    hash1: [file1, file2, file3]
    hash2: [file3, file4]
}

And no, it would not improve the program to even add the option of using a slower hash. If a directory is full of large files, then it would take more time to calculate those hashes than to just eyeball the shortlist that it generates.

EdGr · 07-07-2023, 03:57 PM

Quote:

Originally Posted by Skaperen

but you invented a high speed, proprietary, one?

The techniques are well-known.
Ed

pan64 · 07-08-2023, 03:37 AM

Quote:

Originally Posted by Skaperen

my goal is to see what people who have probably worked with hashing do to deal with multiple and groupings of hashes. it's a learning goal.

probably you will find it useful: https://www.samba.org/~tridge/phd_thesis.pdf
see chapter 3, the rsync algorithm

Skaperen · 07-09-2023, 05:21 PM

Quote:

Originally Posted by teckk

If they match, say so

Code:

for filename in files_list:
    try: 
        full_hash = get_hash(filename, first_chunk_only=False)
        duplicate = hashes_full.get(full_hash)
        if duplicate:
            print("Duplicate found: {} and {}".format(filename, duplicate))
        else:
            hashes_full[full_hash] = filename

etc.

Alter it to suit your needs.

what if either filename or duplicate are paths of different inodes and each has their own set of hardlinked paths? should i repeat the message for each of duplicate's paths?

Skaperen · 07-09-2023, 05:34 PM

Quote:

Originally Posted by EdGr

I don't need a secure hash.

so your algorithm trades security you do not need for comparing arbitrary files on the same system or systems of the same owner, assuming these systems are secure, for speed, so the extra CPU for secure hash is not used. i can see the advantage of that. but i need such an algorithm that is free and commonly available (which will mean it needs to be free and open source). for now, MD5, SHA1, etc., will suffice.

EdGr · 07-09-2023, 09:56 PM

That's fine. You know your requirements.
Ed