LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 07-07-2023, 06:57 AM   #31
Guttorm
Senior Member
 
Registered: Dec 2003
Location: Trondheim, Norway
Distribution: Debian and Ubuntu
Posts: 1,453

Rep: Reputation: 447Reputation: 447Reputation: 447Reputation: 447Reputation: 447

Hi

There's a lot of ways to solve this. Recently, I have started using rsync with --dry-run and verbose. The reason is it has tons of options, it's very fast and it's well tested and documented. Like what about devices, soft/hard links, sparse/not sparse files, file ownership/permissions/ACL and so on. It works over ssh with bwlimit if I want to. And do I really need to compare/hash the files when size and date is the same? It's easy to play with all the options, like --checksum --checksum-choice and so on. But it's not written in python.
 
Old 07-07-2023, 07:47 AM   #32
dugan
LQ Guru
 
Registered: Nov 2003
Location: Canada
Distribution: distro hopper
Posts: 11,226

Rep: Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320
Quote:
Originally Posted by Skaperen View Post
i just knew that if i posted sources, someone would slide off topic.
I can't get over how strange a comment this is. It is not "off-topic" to talk about the code that you wrote to solve the problem being discussed.

Last edited by dugan; 07-07-2023 at 08:03 AM.
 
Old 07-07-2023, 01:10 PM   #33
Skaperen
Senior Member
 
Registered: May 2009
Location: center of singularity
Distribution: Xubuntu, Ubuntu, Slackware, Amazon Linux, OpenBSD, LFS (on Sparc_32 and i386)
Posts: 2,684

Original Poster
Blog Entries: 31

Rep: Reputation: 176Reputation: 176
Quote:
Originally Posted by dugan View Post
I have not gotten off topic. Get a grip.

And your knee-jerk refusal to answer was an incredibly obvious way to say “I didn’t know about it” while trying to save face.

Now, do you want to do this right, which my question was absolutely on-topic and relevant to, or do you want to talk around it in ways that preserve your ego and avoid acknowledging that you did anything wrong?
it is exploring the concept of hashing an entire tree to do file compares relative to another tree. comparing trees is not exactly what i did but is close. but i think the concept of comparing trees is simple enough to deal with hashing an entire tree.

which specific question do you want to get a specific answer to?
 
Old 07-07-2023, 01:23 PM   #34
Skaperen
Senior Member
 
Registered: May 2009
Location: center of singularity
Distribution: Xubuntu, Ubuntu, Slackware, Amazon Linux, OpenBSD, LFS (on Sparc_32 and i386)
Posts: 2,684

Original Poster
Blog Entries: 31

Rep: Reputation: 176Reputation: 176
Quote:
Originally Posted by EdGr View Post
I don't need a secure hash.
for this project, i might not need one, either.

OTOH, someone might try to use any of this in some context that needs strong security.

Quote:
Originally Posted by EdGr View Post
md5 and sha1 have broken security anyway.
they do, indeed. but, for what i'm doing, i don't need security, unless someone is trying to give me false results. if they can put in files to try that, then i have other issues.

Quote:
Originally Posted by EdGr View Post
BTW, this thread has caused me to re-examine hash algorithms.
but you invented a high speed, proprietary, one?
 
Old 07-07-2023, 01:27 PM   #35
Skaperen
Senior Member
 
Registered: May 2009
Location: center of singularity
Distribution: Xubuntu, Ubuntu, Slackware, Amazon Linux, OpenBSD, LFS (on Sparc_32 and i386)
Posts: 2,684

Original Poster
Blog Entries: 31

Rep: Reputation: 176Reputation: 176
Quote:
Originally Posted by pan64 View Post
I'm totally lost. What is the goal now?
my goal is to see what people who have probably worked with hashing do to deal with multiple and groupings of hashes. it's a learning goal.
 
Old 07-07-2023, 01:35 PM   #36
Skaperen
Senior Member
 
Registered: May 2009
Location: center of singularity
Distribution: Xubuntu, Ubuntu, Slackware, Amazon Linux, OpenBSD, LFS (on Sparc_32 and i386)
Posts: 2,684

Original Poster
Blog Entries: 31

Rep: Reputation: 176Reputation: 176
Quote:
Originally Posted by dugan View Post
I can't get over how strange a comment this is. It is not "off-topic" to talk about the code that you wrote to solve the problem being discussed.
yes, it is.

the original topic does not change unless OP explicitly expands it. other poster could expand it, too, if they give a good explanation why.

you should start a new thread. or are you fearing a lawsuit from one of those social media titans?
 
Old 07-07-2023, 02:08 PM   #37
dugan
LQ Guru
 
Registered: Nov 2003
Location: Canada
Distribution: distro hopper
Posts: 11,226

Rep: Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320
Quote:
Originally Posted by Skaperen View Post
my goal is to see what people who have probably worked with hashing do to deal with multiple and groupings of hashes. it's a learning goal.
Sure, we can do that.

Here's mine. My hash is just the file size. As I've said, it's the fastest possible hash. Why? Because it's just filesystem metadata. It takes the same amount of time to "calculate" (e.g. read) no matter how large the file is.

https://gist.github.com/duganchen/1e917c11fce44267b4c4

The reason that hash is good enough is because the point is to come up with a short-list to eyeball. Which you'd be doing anyway even if you were using a slower hash that results in fewer collisions. The script takes a few minutes to go through a couple of TB.

The structure it uses to group hashes is:

Code:
{
    hash1: [file1, file2, file3]
    hash2: [file3, file4]
}
And no, it would not improve the program to even add the option of using a slower hash. If a directory is full of large files, then it would take more time to calculate those hashes than to just eyeball the shortlist that it generates.

Last edited by dugan; 07-07-2023 at 02:19 PM.
 
Old 07-07-2023, 03:57 PM   #38
EdGr
Member
 
Registered: Dec 2010
Location: California, USA
Distribution: I run my own OS
Posts: 998

Rep: Reputation: 470Reputation: 470Reputation: 470Reputation: 470Reputation: 470
Quote:
Originally Posted by Skaperen View Post
but you invented a high speed, proprietary, one?
The techniques are well-known.
Ed
 
Old 07-08-2023, 03:37 AM   #39
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,864

Rep: Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311
Quote:
Originally Posted by Skaperen View Post
my goal is to see what people who have probably worked with hashing do to deal with multiple and groupings of hashes. it's a learning goal.
probably you will find it useful: https://www.samba.org/~tridge/phd_thesis.pdf
see chapter 3, the rsync algorithm
 
Old 07-09-2023, 05:21 PM   #40
Skaperen
Senior Member
 
Registered: May 2009
Location: center of singularity
Distribution: Xubuntu, Ubuntu, Slackware, Amazon Linux, OpenBSD, LFS (on Sparc_32 and i386)
Posts: 2,684

Original Poster
Blog Entries: 31

Rep: Reputation: 176Reputation: 176
Quote:
Originally Posted by teckk View Post

If they match, say so
Code:
for filename in files_list:
    try: 
        full_hash = get_hash(filename, first_chunk_only=False)
        duplicate = hashes_full.get(full_hash)
        if duplicate:
            print("Duplicate found: {} and {}".format(filename, duplicate))
        else:
            hashes_full[full_hash] = filename
etc.

Alter it to suit your needs.
what if either filename or duplicate are paths of different inodes and each has their own set of hardlinked paths? should i repeat the message for each of duplicate's paths?
 
Old 07-09-2023, 05:34 PM   #41
Skaperen
Senior Member
 
Registered: May 2009
Location: center of singularity
Distribution: Xubuntu, Ubuntu, Slackware, Amazon Linux, OpenBSD, LFS (on Sparc_32 and i386)
Posts: 2,684

Original Poster
Blog Entries: 31

Rep: Reputation: 176Reputation: 176
Quote:
Originally Posted by EdGr View Post
I don't need a secure hash.
so your algorithm trades security you do not need for comparing arbitrary files on the same system or systems of the same owner, assuming these systems are secure, for speed, so the extra CPU for secure hash is not used. i can see the advantage of that. but i need such an algorithm that is free and commonly available (which will mean it needs to be free and open source). for now, MD5, SHA1, etc., will suffice.
 
Old 07-09-2023, 09:56 PM   #42
EdGr
Member
 
Registered: Dec 2010
Location: California, USA
Distribution: I run my own OS
Posts: 998

Rep: Reputation: 470Reputation: 470Reputation: 470Reputation: 470Reputation: 470
That's fine. You know your requirements.
Ed
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
LXer: How to Use tree to Show a Directory Tree in the Linux Terminal LXer Syndicated Linux News 0 06-26-2022 05:03 PM
Look for duplicates in folder tree A and folder tree B, then delete the duplicates only from A. grumpyskeptic Linux - Software 7 10-27-2018 10:23 PM
what is the difference strict binary tree nad extended binary tree . tushar_pandey Programming 1 07-18-2012 11:30 AM
the bible = the tree of the knowledge of good and evil () Jesus = the tree of life Michael111 General 2 04-14-2004 04:28 PM
need a P-Tree (Patricia Tree) library manaskb Programming 1 11-02-2002 06:15 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 01:27 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration