ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
did that include implementation of a hashing algorithm? if so, i'd agree, speed is essential.
my intent in Python is to use the existing Python "hashlib" implementation. it did a hash of my home directory (9.54+ GB) in under 15 seconds using the SHA1 algorithm (i had the entire tree cached in RAM at that point in time and this was a measure of wall clock time).
even for things i expect to implement in C, i often find Python a nice way to prototype.
speed is important for a frequently used tool. in this case, a hash algorithm in C, assembly, microcode, or a hardware gate array, can certainly help. i believe i was running Python's copy and integration of specification C implementations.
Post #7 is an example to do exactly what you are asking for. Spits out duplicate files in 2 directories based on their hash. It will also find duplicate files in one directory. Give it one argument.
It's fairly fast because it checks the start chunk of each file, only goes farther if they match. So it doesn't have to read the whole file each time.
Make a dict of file size in bytes, and hash1k size in bytes, then a dict of full_file_hash
did that include implementation of a hashing algorithm? if so, i'd agree, speed is essential.
Yes, I wrote my own hash which runs a lot faster than md5sum or sha1sum for files cached in RAM (scans is ~9GB).
Code:
% time fileinfo -c scans 1> /dev/null
real 0m2.408s
user 0m1.547s
sys 0m0.861s
% time sha1sum `find scans -type f` 1> /dev/null
real 0m18.818s
user 0m17.947s
sys 0m0.872s
% time md5sum `find scans -type f` 1> /dev/null
real 0m15.041s
user 0m14.071s
sys 0m0.970s
I'm not sure why you wrote that when os.walk exists.
i just knew that if i posted sources, someone would slide off topic. i should have known it would be you. if you really want to know why, you should ask on a new thread and refer me to it. if you don't do that, i'll assume you don't really care. my bet is that you don't really care.
And your knee-jerk refusal to answer was an incredibly obvious way to say “I didn’t know about it” while trying to save face.
Now, do you want to do this right, which my question was absolutely on-topic and relevant to, or do you want to talk around it in ways that preserve your ego and avoid acknowledging that you did anything wrong?
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.