LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 07-08-2003, 07:58 AM   #1
topche
LQ Newbie
 
Registered: Sep 2002
Distribution: SuSE
Posts: 23

Rep: Reputation: 15
dublicate values in file


I have one file :
2066e8ff7db17bf1ead4dcea8afc73fc album02/0011_G.sized.jpg
fabd9a5c8672148995ef66a9d80a31dd album02/0011_G.thumb.jpg
01e731735d7c2f0c0142a265b470eadf album02/0023_G.jpg
7c503161cbabcf9ec49ca3feead1bb6b album02/0023_G.sized.jpg
03014a1ca0e56b282801d9c27f618f58 album02/0023_G.thumb.jpg
...............................
...............................

how can i get dublicate md5sum in file
 
Old 07-08-2003, 08:12 AM   #2
TheLinuxDuck
Member
 
Registered: Sep 2002
Location: Tulsa, OK
Distribution: Slack, baby!
Posts: 349

Rep: Reputation: 33
topche, you're going to do more explaining if you want help. Are you saying that there are duplicate md5 sums in the file already, and you want to know how they got there, or are you saying that you want to FIND duplicates, or do you want to CREATE duplicates in the file, or is it something else entirely?
 
Old 07-08-2003, 08:19 AM   #3
topche
LQ Newbie
 
Registered: Sep 2002
Distribution: SuSE
Posts: 23

Original Poster
Rep: Reputation: 15
I want to know who md5sums are double in the file.
(sorry for my bad english )
 
Old 07-08-2003, 08:39 AM   #4
TheLinuxDuck
Member
 
Registered: Sep 2002
Location: Tulsa, OK
Distribution: Slack, baby!
Posts: 349

Rep: Reputation: 33
So, you want to find duplicates in the file.. what programming language are you working in?

(what is your native tongue? I could probably figure out what you're wanting, if you are having a tough time explaining it in english... and don't worry about the english thing.. just take your time, if need be)
 
Old 07-08-2003, 08:42 AM   #5
topche
LQ Newbie
 
Registered: Sep 2002
Distribution: SuSE
Posts: 23

Original Poster
Rep: Reputation: 15
BASH, PHP and verry litle perl ..
a can make this with BASH, but it will be verry slow
 
Old 07-08-2003, 09:19 AM   #6
TheLinuxDuck
Member
 
Registered: Sep 2002
Location: Tulsa, OK
Distribution: Slack, baby!
Posts: 349

Rep: Reputation: 33
with perl, load the file, put the md5 sums into a hash, and if the hash key exists, you know you have a dupe.

Here's some code:
Code:
#!/usr/bin/perl
use strict;
use warnings;

#
# assume that filenames of files to parse are passed in as CL args
#
while(my $filename = shift @ARGV) {
  #
  # test to see if file can open
  #
  if(open IN, $filename) {
    #
    # create temp hash for sum checking, and dupe count
    #
    my($sums) = {};
    my($dups) = 0;
    #
    # loop through lines of file
    #
    while(my $line =  <IN>) {
      chomp $line;
      next if($line =~ /^\s*$/);  # skip blanks
      #
      #  make sure line starts with md5 sum
      #
      if($line =~ /^([0-9a-fA-F]{32})/) {
        my($hash) = $1;
        if(defined($sums->{$hash})) {
          print "hash '$hash' exists already\n";
          ++$dups;
        }
        else {
          $sums->{$hash} = 1;
        }
      }
      else {
        print "line '$line' in invalid format\n";
      }
    }
    close IN;
    print "$dups duplicates found in '$filename'\n";
  }
  else {
    print "Couldn't open '$filename': $!\n";
  }
}
That's a barebones check, which doesn't do anything with the info, aside from tell you. (=

Hope that helps!
 
Old 07-08-2003, 11:03 AM   #7
Hko
Senior Member
 
Registered: Aug 2002
Location: Groningen, The Netherlands
Distribution: ubuntu
Posts: 2,530

Rep: Reputation: 108Reputation: 108
The bash script below, will print the lines with duplicate md5sums once, but with the number of occurences of the md5sum in front of each line.

Pass the name of the file with the MD5 sums as the first argument to the script.
Code:
#!/bin/bash

echo "Duplicate MD5 sums in $1:"
sort -t' ' -k1,1 "$1" | uniq -c -d -t' ' -W1
Note that this will not be very slow.

Last edited by Hko; 07-08-2003 at 11:10 AM.
 
Old 07-08-2003, 11:28 AM   #8
TheLinuxDuck
Member
 
Registered: Sep 2002
Location: Tulsa, OK
Distribution: Slack, baby!
Posts: 349

Rep: Reputation: 33
Nice Hko! (=

Btw, what version of uniq are you running? Mine is textutils 2.0, and I don't see the -t' ' or -W1 options available with it..

And with a slight modification:
Code:
cat hashes | grep --invert-match "^$" | sort -t' ' -k1,1 | uniq -c -d
it will ignore empty lines (there may be a better way than using grep with an inverted match, but that's the only way I could think at the moment.. )

Last edited by TheLinuxDuck; 07-08-2003 at 11:38 AM.
 
Old 07-08-2003, 02:52 PM   #9
Hko
Senior Member
 
Registered: Aug 2002
Location: Groningen, The Netherlands
Distribution: ubuntu
Posts: 2,530

Rep: Reputation: 108Reputation: 108
I have the GNU uniq 5.0 from the Debian sarge (testing) package coreutils 5.0-4.
Code:
$ uniq --help
Usage: uniq [OPTION]... [INPUT [OUTPUT]]
Discard all but one of successive identical lines from INPUT (or
standard input), writing to OUTPUT (or standard output).

Mandatory arguments to long options are mandatory for short options too.
  -c, --count           prefix lines by the number of occurrences
  -d, --repeated        only print duplicate lines
  -D, --all-repeated[=delimit-method] print all duplicate lines
                        delimit-method={none(default),prepend,separate}
                        Delimiting is done with blank lines.
  -f, --skip-fields=N   avoid comparing the first N fields
  -i, --ignore-case     ignore differences in case when comparing
  -s, --skip-chars=N    avoid comparing the first N characters
  -t, --separator=SEP   use SEParator to delimit fields
  -u, --unique          only print unique lines
  -w, --check-chars=N   compare no more than N characters in lines
  -W, --check-fields=N  compare no more than N fields in lines
      --help     display this help and exit
      --version  output version information and exit

A field is a run of whitespace, then non-whitespace characters, unless
a SEParator is given.  Fields are skipped before chars.
(Now reading this last sentence I suppose the -t' ' was not necesary.)

The -W1 option for uniq is really needed. Because of this option only the md5 hash is compared to determine unique-ness. Without the -W1 the whole line is compared, so two identical files with different names will not be seen as duplicates. This can be solved by using cut/sed/awk to strip of the filenames from the end of the line, but the filenames won't be output then either.

I used the same options on sort, only in the hope this would perform (a very little) better. In this case sort doesn't really need these options.

I didn't realize there could be uniq's without field selecting options. Does your uniq allow for selecting a range of character from the lines to compare? (see above uniq --help output, -w option)

Last edited by Hko; 07-08-2003 at 03:08 PM.
 
Old 07-08-2003, 03:00 PM   #10
TheLinuxDuck
Member
 
Registered: Sep 2002
Location: Tulsa, OK
Distribution: Slack, baby!
Posts: 349

Rep: Reputation: 33
coreutils.. well, that explains why textutils hasn't been updated in a while!! (= Guess I need to start paying closer attention. (=
 
Old 07-09-2003, 04:44 AM   #11
topche
LQ Newbie
 
Registered: Sep 2002
Distribution: SuSE
Posts: 23

Original Poster
Rep: Reputation: 15
verry, verry thank you, guys .. i use this:
cat hashes | grep --invert-match "^$" | sort -t' ' -k1,1 | uniq -c -d

and this:

#!/usr/bin/perl
while(<>) {
($md5,$file) = /(\w+)\s+(.*)/g;
if ($list{$md5}) {
print "dube :", $list{$md5}, " and ", $file, "\n";
next;
}
$list{$md5} = $file;
}
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
checking for null values from config file? zerointeger Programming 1 10-12-2005 11:29 AM
bash; reading values from a file km4hr Programming 16 07-28-2005 02:07 PM
Linux changes processor values. macisaint Linux - Hardware 1 06-28-2005 11:43 PM
Sync Values waslit Linux - Software 0 06-05-2004 07:34 AM
rx/tx values Nauseous *BSD 1 05-26-2004 02:06 PM


All times are GMT -5. The time now is 03:13 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration