LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 07-05-2023, 03:11 PM   #16
dugan
LQ Guru
 
Registered: Nov 2003
Location: Canada
Distribution: distro hopper
Posts: 11,226

Rep: Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320

Well, what’s the use case here? If it’s to reclaim HD space then the answer would be no; you don’t care about hard links.
 
Old 07-05-2023, 03:35 PM   #17
Skaperen
Senior Member
 
Registered: May 2009
Location: center of singularity
Distribution: Xubuntu, Ubuntu, Slackware, Amazon Linux, OpenBSD, LFS (on Sparc_32 and i386)
Posts: 2,684

Original Poster
Blog Entries: 31

Rep: Reputation: 176Reputation: 176
Quote:
Originally Posted by EdGr View Post
I did that in C. Speed is required.
Ed
did that include implementation of a hashing algorithm? if so, i'd agree, speed is essential.

my intent in Python is to use the existing Python "hashlib" implementation. it did a hash of my home directory (9.54+ GB) in under 15 seconds using the SHA1 algorithm (i had the entire tree cached in RAM at that point in time and this was a measure of wall clock time).

even for things i expect to implement in C, i often find Python a nice way to prototype.

speed is important for a frequently used tool. in this case, a hash algorithm in C, assembly, microcode, or a hardware gate array, can certainly help. i believe i was running Python's copy and integration of specification C implementations.
 
Old 07-05-2023, 03:44 PM   #18
teckk
LQ Guru
 
Registered: Oct 2004
Distribution: Arch
Posts: 5,138
Blog Entries: 6

Rep: Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827
Post #7 is an example to do exactly what you are asking for. Spits out duplicate files in 2 directories based on their hash. It will also find duplicate files in one directory. Give it one argument.

It's fairly fast because it checks the start chunk of each file, only goes farther if they match. So it doesn't have to read the whole file each time.

Make a dict of file size in bytes, and hash1k size in bytes, then a dict of full_file_hash
Code:
def check_for_duplicates(paths, hash=hashlib.sha1):
    hashes_by_size = defaultdict(list)  # dict of size_in_bytes
    hashes_on_1k = defaultdict(list)  # dict of (hash1k, size_in_bytes)
    hashes_full = {}
Get all files that have the same size, they are the collision candidates
Code:
for path in paths:
    for dirpath, dirnames, filenames in os.walk(path):
        for filename in filenames:
            full_path = os.path.join(dirpath, filename)
If the target is a symlink (soft one), this will dereference it, change the value to the actual target file
Code:
full_path = os.path.realpath(full_path)
file_size = os.path.getsize(full_path)
hashes_by_size[file_size].append(full_path)
For all files with the same file size, get their hash on the 1st 1024 bytes only
Code:
for size_in_bytes, files in hashes_by_size.items():
    if len(files) < 2:
        continue  # File size is unique, no need to spend CPU cycles on it
For all files with the hash on the 1st 1024 bytes, get their hash on the full file, collisions will be duplicates
Code:
for __, files_list in hashes_on_1k.items():
    if len(files_list) < 2:
        continue    # Hash of fist 1k file bytes is unique, no need to spend cpy cycles on it
If they match, say so
Code:
for filename in files_list:
    try: 
        full_hash = get_hash(filename, first_chunk_only=False)
        duplicate = hashes_full.get(full_hash)
        if duplicate:
            print("Duplicate found: {} and {}".format(filename, duplicate))
        else:
            hashes_full[full_hash] = filename
etc.

Alter it to suit your needs.
 
Old 07-05-2023, 03:52 PM   #19
teckk
LQ Guru
 
Registered: Oct 2004
Distribution: Arch
Posts: 5,138
Blog Entries: 6

Rep: Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827Reputation: 1827
Here is another example, keeps a log file.
Code:
#!/usr/bin/python

#Find duplicate files in directory tree.

import sys
import os
import hashlib

def chunk_reader(fobj, chunk_size=1024):
    while True:
        chunk = fobj.read(chunk_size)
        if not chunk:
            return
        yield chunk

def check_for_duplicates(paths, hash=hashlib.sha1):
    hashes = {}
    for path in paths:
        for dirpath, dirnames, filenames in os.walk(path):
            for filename in filenames:
                full_path = os.path.join(dirpath, filename)
                hashobj = hash()
                try:
                    for chunk in chunk_reader(open(full_path, 'rb')):
                        hashobj.update(chunk)
                    file_id = (hashobj.digest(), os.path.getsize(full_path))
                    duplicate = hashes.get(file_id, None)
                    with open('Duplog.txt', 'a') as f:
                        if duplicate:
                            f.write(full_path + ' <-AND-> ' + duplicate + '\n')
                            print("Duplicate found: %s <-AND-> %s" % (full_path, duplicate))
                        else:
                            hashes[file_id] = full_path
                except (OSError,):
                    continue

if sys.argv[1:]:
    check_for_duplicates(sys.argv[1:])
else:
    print("Please pass the paths to check as parameters to the script")
Or spit out the ones that do not match on either one of those scripts.
 
Old 07-05-2023, 04:00 PM   #20
EdGr
Member
 
Registered: Dec 2010
Location: California, USA
Distribution: I run my own OS
Posts: 998

Rep: Reputation: 470Reputation: 470Reputation: 470Reputation: 470Reputation: 470
Quote:
Originally Posted by Skaperen View Post
did that include implementation of a hashing algorithm? if so, i'd agree, speed is essential.
Yes, I wrote my own hash which runs a lot faster than md5sum or sha1sum for files cached in RAM (scans is ~9GB).

Code:
% time fileinfo -c scans 1> /dev/null

real	0m2.408s
user	0m1.547s
sys	0m0.861s

% time sha1sum `find scans -type f` 1> /dev/null

real	0m18.818s
user	0m17.947s
sys	0m0.872s

% time md5sum `find scans -type f` 1> /dev/null

real	0m15.041s
user	0m14.071s
sys	0m0.970s
Ed
 
Old 07-05-2023, 05:35 PM   #21
Skaperen
Senior Member
 
Registered: May 2009
Location: center of singularity
Distribution: Xubuntu, Ubuntu, Slackware, Amazon Linux, OpenBSD, LFS (on Sparc_32 and i386)
Posts: 2,684

Original Poster
Blog Entries: 31

Rep: Reputation: 176Reputation: 176
Quote:
Originally Posted by EdGr View Post
The algorithm is similar to md5sum, sort, and uniq. The code is not free.
Ed
what is the advantage of that algorithm?
 
Old 07-05-2023, 05:59 PM   #22
Skaperen
Senior Member
 
Registered: May 2009
Location: center of singularity
Distribution: Xubuntu, Ubuntu, Slackware, Amazon Linux, OpenBSD, LFS (on Sparc_32 and i386)
Posts: 2,684

Original Poster
Blog Entries: 31

Rep: Reputation: 176Reputation: 176
Quote:
Originally Posted by teckk View Post
Post that would you. If it's not too big. I want to see what approach you took.
it's big, so i'll give this link: http://ipal.net/try/list_dup_files_not_linked.py

it's not finished and could end up with big changes.
 
Old 07-05-2023, 06:11 PM   #23
Skaperen
Senior Member
 
Registered: May 2009
Location: center of singularity
Distribution: Xubuntu, Ubuntu, Slackware, Amazon Linux, OpenBSD, LFS (on Sparc_32 and i386)
Posts: 2,684

Original Poster
Blog Entries: 31

Rep: Reputation: 176Reputation: 176
Quote:
Originally Posted by Skaperen View Post
it's big, so i'll give this link: http://ipal.net/try/list_dup_files_not_linked.py

it's not finished and could end up with big changes.
if you want to try to run this, you might need my file tree walker generator at http://ipal.net/python/ftrgen.py

Last edited by Skaperen; 07-05-2023 at 06:13 PM.
 
Old 07-05-2023, 07:50 PM   #24
EdGr
Member
 
Registered: Dec 2010
Location: California, USA
Distribution: I run my own OS
Posts: 998

Rep: Reputation: 470Reputation: 470Reputation: 470Reputation: 470Reputation: 470
Quote:
Originally Posted by Skaperen View Post
what is the advantage of that algorithm?
The files don't need to be compared. If the 128-bit checksums match, the files are assumed to be identical.

The probability of a false positive is far below the hardware soft error rate, and the time required is far above the age of the universe.
Ed

Last edited by EdGr; 07-05-2023 at 07:55 PM.
 
Old 07-05-2023, 09:13 PM   #25
dugan
LQ Guru
 
Registered: Nov 2003
Location: Canada
Distribution: distro hopper
Posts: 11,226

Rep: Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320
Quote:
Originally Posted by Skaperen View Post
if you want to try to run this, you might need my file tree walker generator at http://ipal.net/python/ftrgen.py
I'm not sure why you wrote that when os.walk exists.
 
Old 07-06-2023, 01:12 PM   #26
Skaperen
Senior Member
 
Registered: May 2009
Location: center of singularity
Distribution: Xubuntu, Ubuntu, Slackware, Amazon Linux, OpenBSD, LFS (on Sparc_32 and i386)
Posts: 2,684

Original Poster
Blog Entries: 31

Rep: Reputation: 176Reputation: 176
Quote:
Originally Posted by EdGr View Post
The files don't need to be compared. If the 128-bit checksums match, the files are assumed to be identical.

The probability of a false positive is far below the hardware soft error rate, and the time required is far above the age of the universe.
Ed
sorry, i meant what is the advantage of your particular algorithm invention over others like md5 and sha1?
 
Old 07-06-2023, 01:18 PM   #27
Skaperen
Senior Member
 
Registered: May 2009
Location: center of singularity
Distribution: Xubuntu, Ubuntu, Slackware, Amazon Linux, OpenBSD, LFS (on Sparc_32 and i386)
Posts: 2,684

Original Poster
Blog Entries: 31

Rep: Reputation: 176Reputation: 176
Quote:
Originally Posted by dugan View Post
I'm not sure why you wrote that when os.walk exists.
i just knew that if i posted sources, someone would slide off topic. i should have known it would be you. if you really want to know why, you should ask on a new thread and refer me to it. if you don't do that, i'll assume you don't really care. my bet is that you don't really care.
 
Old 07-06-2023, 02:36 PM   #28
dugan
LQ Guru
 
Registered: Nov 2003
Location: Canada
Distribution: distro hopper
Posts: 11,226

Rep: Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320Reputation: 5320
I have not gotten off topic. Get a grip.

And your knee-jerk refusal to answer was an incredibly obvious way to say “I didn’t know about it” while trying to save face.

Now, do you want to do this right, which my question was absolutely on-topic and relevant to, or do you want to talk around it in ways that preserve your ego and avoid acknowledging that you did anything wrong?

Last edited by dugan; 07-06-2023 at 02:55 PM.
 
Old 07-06-2023, 02:59 PM   #29
EdGr
Member
 
Registered: Dec 2010
Location: California, USA
Distribution: I run my own OS
Posts: 998

Rep: Reputation: 470Reputation: 470Reputation: 470Reputation: 470Reputation: 470
Quote:
Originally Posted by Skaperen View Post
sorry, i meant what is the advantage of your particular algorithm invention over others like md5 and sha1?
I don't need a secure hash.

md5 and sha1 have broken security anyway.

BTW, this thread has caused me to re-examine hash algorithms.
Ed
 
Old 07-07-2023, 01:07 AM   #30
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,863

Rep: Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311Reputation: 7311
I'm totally lost. What is the goal now?
 
1 members found this post helpful.
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
LXer: How to Use tree to Show a Directory Tree in the Linux Terminal LXer Syndicated Linux News 0 06-26-2022 05:03 PM
Look for duplicates in folder tree A and folder tree B, then delete the duplicates only from A. grumpyskeptic Linux - Software 7 10-27-2018 10:23 PM
what is the difference strict binary tree nad extended binary tree . tushar_pandey Programming 1 07-18-2012 11:30 AM
the bible = the tree of the knowledge of good and evil () Jesus = the tree of life Michael111 General 2 04-14-2004 04:28 PM
need a P-Tree (Patricia Tree) library manaskb Programming 1 11-02-2002 06:15 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 09:26 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration