LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 07-04-2023, 07:14 PM   #1
Skaperen
Senior Member
 
Registered: May 2009
Location: center of singularity
Distribution: Xubuntu, Ubuntu, Slackware, Amazon Linux, OpenBSD, LFS (on Sparc_32 and i386)
Posts: 2,684
Blog Entries: 31

Rep: Reputation: 176Reputation: 176
hashing a tree of files


using Python3, i want to hash the contents of a tree of files to verify that all the files in one tree are the same as all the files in another tree even if the files are in random order, as long as the names are correct with the correct content matching the names. what can you suggest?
 
Old 07-04-2023, 07:23 PM   #2
TB0ne
LQ Guru
 
Registered: Jul 2003
Location: Birmingham, Alabama
Distribution: SuSE, RedHat, Slack,CentOS
Posts: 26,637

Rep: Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965Reputation: 7965
Quote:
Originally Posted by Skaperen View Post
using Python3, i want to hash the contents of a tree of files to verify that all the files in one tree are the same as all the files in another tree even if the files are in random order, as long as the names are correct with the correct content matching the names. what can you suggest?
I'd sort the files by name/path, MD5 each one, and store the pair in an array. Do same for other tree, then compare the two arrays. Depending on the number of files, of course; a few hundred should be a huge problem, but if you're talking thousands, it could take a bit, or may require a temp file. Loads of ways to approach.
 
Old 07-04-2023, 08:21 PM   #3
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Rocky 9.2
Posts: 18,359

Rep: Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751
I'd use hashes in Perl for a couple of reasons:

1. no sorting reqd.
2. easier to deal with both lists not being identical ie files in list A not in list B and vice versa.
 
1 members found this post helpful.
Old 07-04-2023, 09:49 PM   #4
NevemTeve
Senior Member
 
Registered: Oct 2011
Location: Budapest
Distribution: Debian/GNU/Linux, AIX
Posts: 4,866
Blog Entries: 1

Rep: Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869
You can compare the content of two files without calculating any hash. Hint: cmp(1)
 
Old 07-04-2023, 10:35 PM   #5
dugan
LQ Guru
 
Registered: Nov 2003
Location: Canada
Distribution: distro hopper
Posts: 11,226

Rep: Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320
Or “diff -b”

I’d iterate over the” tree, looks for each file in “the other tree” with the same filename, and do the comparison. The code for that is fairly obvious.

One good speed optimization is to not even look at the file contents unless the files have the same sizes. Reading sizes is a very fast operation.

Last edited by dugan; 07-04-2023 at 10:40 PM.
 
Old 07-05-2023, 02:13 AM   #6
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,863

Rep: Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311
dupfinder?
 
Old 07-05-2023, 10:20 AM   #7
teckk
LQ Guru
 
Registered: Oct 2004
Distribution: Arch
Posts: 5,138
Blog Entries: 6

Rep: Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827
Quote:
using Python3, i want to hash the contents of a tree of files to verify that all the files in one tree are the same
findDup.py
Code:
#!/usr/bin/python

from collections import defaultdict
import hashlib
import os
import sys

def chunk_reader(fobj, chunk_size=1024):
    while True:
        chunk = fobj.read(chunk_size)
        if not chunk:
            return
        yield chunk

def get_hash(filename, first_chunk_only=False, hash=hashlib.sha1):
    hashobj = hash()
    file_object = open(filename, 'rb')

    if first_chunk_only:
        hashobj.update(file_object.read(1024))
    else:
        for chunk in chunk_reader(file_object):
            hashobj.update(chunk)
    hashed = hashobj.digest()

    file_object.close()
    return hashed

def check_for_duplicates(paths, hash=hashlib.sha1):
    hashes_by_size = defaultdict(list)
    hashes_on_1k = defaultdict(list)
    hashes_full = {}

    for path in paths:
        for dirpath, dirnames, filenames in os.walk(path):
            for filename in filenames:
                full_path = os.path.join(dirpath, filename)
                try:
                    full_path = os.path.realpath(full_path)
                    file_size = os.path.getsize(full_path)
                    hashes_by_size[file_size].append(full_path)
                except (OSError,):
                    continue

    for size_in_bytes, files in hashes_by_size.items():
        if len(files) < 2:
            continue

        for filename in files:
            try:
                small_hash = get_hash(filename, first_chunk_only=True)
                hashes_on_1k[(small_hash, size_in_bytes)].append(filename)
            except (OSError,):
                continue

    for __, files_list in hashes_on_1k.items():
        if len(files_list) < 2:
            continue

        for filename in files_list:
            try: 
                full_hash = get_hash(filename, first_chunk_only=False)
                duplicate = hashes_full.get(full_hash)
                if duplicate:
                    print("Duplicate found: {} and {}".format(filename, duplicate))
                else:
                    hashes_full[full_hash] = filename
            except (OSError,):
                continue
                
if __name__ == "__main__":
    if sys.argv[1:]:
        check_for_duplicates(sys.argv[1:])
    else:
        print("Please pass the paths to check as parameters to the script")
And that would be run with path arguments
Code:
python ./findDup.py /path/to/dir1 /path/to/dir2
 
Old 07-05-2023, 11:29 AM   #8
EdGr
Member
 
Registered: Dec 2010
Location: California, USA
Distribution: I run my own OS
Posts: 998

Rep: Reputation: 470Reputation: 470Reputation: 470Reputation: 470Reputation: 470
I did that in C. Speed is required.
Ed
 
Old 07-05-2023, 11:47 AM   #9
teckk
LQ Guru
 
Registered: Oct 2004
Distribution: Arch
Posts: 5,138
Blog Entries: 6

Rep: Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827
Post that would you. If it's not too big. I want to see what approach you took.
 
Old 07-05-2023, 11:55 AM   #10
EdGr
Member
 
Registered: Dec 2010
Location: California, USA
Distribution: I run my own OS
Posts: 998

Rep: Reputation: 470Reputation: 470Reputation: 470Reputation: 470Reputation: 470
The algorithm is similar to md5sum, sort, and uniq. The code is not free.
Ed
 
Old 07-05-2023, 02:42 PM   #11
Skaperen
Senior Member
 
Registered: May 2009
Location: center of singularity
Distribution: Xubuntu, Ubuntu, Slackware, Amazon Linux, OpenBSD, LFS (on Sparc_32 and i386)
Posts: 2,684

Original Poster
Blog Entries: 31

Rep: Reputation: 176Reputation: 176
Quote:
Originally Posted by NevemTeve View Post
You can compare the content of two files without calculating any hash. Hint: cmp(1)
how do i use cmp(1) to compare every file (in a directory or tree or whatever) to all the others? the reason i would use a hash is so i don't have to read a file again every time i need to compare it to another file.
 
Old 07-05-2023, 02:49 PM   #12
dugan
LQ Guru
 
Registered: Nov 2003
Location: Canada
Distribution: distro hopper
Posts: 11,226

Rep: Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320
With loops. Duh.

And you can have your code memoize the comparison results. That means that you store them so that you know not to calculate them twice.

That said, your idea of using hashes would work well and you should just go ahead and code it.

Last edited by dugan; 07-05-2023 at 02:52 PM.
 
Old 07-05-2023, 02:55 PM   #13
Skaperen
Senior Member
 
Registered: May 2009
Location: center of singularity
Distribution: Xubuntu, Ubuntu, Slackware, Amazon Linux, OpenBSD, LFS (on Sparc_32 and i386)
Posts: 2,684

Original Poster
Blog Entries: 31

Rep: Reputation: 176Reputation: 176
Quote:
Originally Posted by dugan View Post
Or “diff -b”

I’d iterate over the” tree, looks for each file in “the other tree” with the same filename, and do the comparison. The code for that is fairly obvious.

One good speed optimization is to not even look at the file contents unless the files have the same sizes. Reading sizes is a very fast operation.
the reason i would use a hash is to be able to "compare" the file to many other files. once i have a digest of its contents, comparing digests is nearly as trivial as comparing sizes. compare two digests completes as "unequal" if the first part (potentially 32 bits or 64 bits) is unequal. ISTM that at this point, comparing size is a waste of time though if i were using C i might find a way to integrate such comparisons together.

i am curious how “diff -b” adds to any of this.
 
Old 07-05-2023, 02:58 PM   #14
NevemTeve
Senior Member
 
Registered: Oct 2011
Location: Budapest
Distribution: Debian/GNU/Linux, AIX
Posts: 4,866
Blog Entries: 1

Rep: Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869Reputation: 1869
As a start, show the current state of your program.
 
Old 07-05-2023, 03:08 PM   #15
Skaperen
Senior Member
 
Registered: May 2009
Location: center of singularity
Distribution: Xubuntu, Ubuntu, Slackware, Amazon Linux, OpenBSD, LFS (on Sparc_32 and i386)
Posts: 2,684

Original Poster
Blog Entries: 31

Rep: Reputation: 176Reputation: 176
Quote:
Originally Posted by pan64 View Post
dupfinder?
all i find online is a Windows version. my project is for Linux. i do not find a Linux version in the online search results (first 25 results).

i would be curious if it considers a file hard linked under 2 or more names to be duplicates or not. what i have put together, so far, does not. it does this by recording inodes as file "names", although it also records all links and shows (hard) links of each inode in the results.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
LXer: How to Use tree to Show a Directory Tree in the Linux Terminal LXer Syndicated Linux News 0 06-26-2022 05:03 PM
Look for duplicates in folder tree A and folder tree B, then delete the duplicates only from A. grumpyskeptic Linux - Software 7 10-27-2018 10:23 PM
what is the difference strict binary tree nad extended binary tree . tushar_pandey Programming 1 07-18-2012 11:30 AM
the bible = the tree of the knowledge of good and evil () Jesus = the tree of life Michael111 General 2 04-14-2004 04:28 PM
need a P-Tree (Patricia Tree) library manaskb Programming 1 11-02-2002 06:15 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 11:49 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration