ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
There's a lot of ways to solve this. Recently, I have started using rsync with --dry-run and verbose. The reason is it has tons of options, it's very fast and it's well tested and documented. Like what about devices, soft/hard links, sparse/not sparse files, file ownership/permissions/ACL and so on. It works over ssh with bwlimit if I want to. And do I really need to compare/hash the files when size and date is the same? It's easy to play with all the options, like --checksum --checksum-choice and so on. But it's not written in python.
And your knee-jerk refusal to answer was an incredibly obvious way to say “I didn’t know about it” while trying to save face.
Now, do you want to do this right, which my question was absolutely on-topic and relevant to, or do you want to talk around it in ways that preserve your ego and avoid acknowledging that you did anything wrong?
it is exploring the concept of hashing an entire tree to do file compares relative to another tree. comparing trees is not exactly what i did but is close. but i think the concept of comparing trees is simple enough to deal with hashing an entire tree.
which specific question do you want to get a specific answer to?
OTOH, someone might try to use any of this in some context that needs strong security.
Quote:
Originally Posted by EdGr
md5 and sha1 have broken security anyway.
they do, indeed. but, for what i'm doing, i don't need security, unless someone is trying to give me false results. if they can put in files to try that, then i have other issues.
Quote:
Originally Posted by EdGr
BTW, this thread has caused me to re-examine hash algorithms.
my goal is to see what people who have probably worked with hashing do to deal with multiple and groupings of hashes. it's a learning goal.
Sure, we can do that.
Here's mine. My hash is just the file size. As I've said, it's the fastest possible hash. Why? Because it's just filesystem metadata. It takes the same amount of time to "calculate" (e.g. read) no matter how large the file is.
The reason that hash is good enough is because the point is to come up with a short-list to eyeball. Which you'd be doing anyway even if you were using a slower hash that results in fewer collisions. The script takes a few minutes to go through a couple of TB.
And no, it would not improve the program to even add the option of using a slower hash. If a directory is full of large files, then it would take more time to calculate those hashes than to just eyeball the shortlist that it generates.
for filename in files_list:
try:
full_hash = get_hash(filename, first_chunk_only=False)
duplicate = hashes_full.get(full_hash)
if duplicate:
print("Duplicate found: {} and {}".format(filename, duplicate))
else:
hashes_full[full_hash] = filename
etc.
Alter it to suit your needs.
what if either filename or duplicate are paths of different inodes and each has their own set of hardlinked paths? should i repeat the message for each of duplicate's paths?
so your algorithm trades security you do not need for comparing arbitrary files on the same system or systems of the same owner, assuming these systems are secure, for speed, so the extra CPU for secure hash is not used. i can see the advantage of that. but i need such an algorithm that is free and commonly available (which will mean it needs to be free and open source). for now, MD5, SHA1, etc., will suffice.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.