LinuxQuestions.org
Review your favorite Linux distribution.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 07-17-2006, 04:54 AM   #1
morrolan
Member
 
Registered: Sep 2003
Location: Manchester UK
Posts: 264

Rep: Reputation: 30
bash: identify duplicate MS Outlook .msg files?


HI all,
I'm looking for a way to de-duplicate MS Outlook .msg files, contained within one main directory but with sub-directories going down several levels.

The problem is - they ALL have to be compared to each other. I was considering creating a .csv file with each filename and md5sum on a line, and then importing to OOo or Excel and working out how to identify duplicates there.

My problem is, I'm looking at about 120,000 - 180,000 emails involved in a high-profile litigation case.

I did some testing of md5sums of emails sent to various people across various servers (identical body etc) but it appears that the headers are modified and this appears to prevent md5 from being a viable option (my theory at least).

If anyone could provide any help, it would be greatly appreciated.

Morrolan
 
Old 07-17-2006, 05:28 AM   #2
jschiwal
LQ Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 682Reputation: 682Reputation: 682Reputation: 682Reputation: 682Reputation: 682
When you say duplicates, do you mean that the files are identical or that the body is the same?
Would two or more duplicate .msg files have the identical names?

There is a "cmp" program that can compare two files. If two duplicates are identical, consider making a table of file and file sizes. Then you will only need to compare files of identical sizes.

One way of doing this would be: find ./path/to/msgs -iname "*.msg" -fprintf messagelist.tsv "%s\t%P"
This would produce a tab seperated value file of two columns. The size in bytes and the filename.
find ./path/to/msgs -iname "*.msg" -printf "%s\t%P" | sort -k1 -n >messagelist.tsv

The second example produces a numerically sorted list. However, I don't know how the sort program would handle such a large stream. Commands like sort and awk. You could use a comma to separate files if you want. If you do, you need to use an option to use a space to separate fields in the sort command.

The
 
Old 07-17-2006, 05:41 AM   #3
unSpawn
Moderator
 
Registered: May 2001
Posts: 29,415
Blog Entries: 55

Rep: Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600
Each sent email gets a uniq header, so the first thing I would do is look for duplicate "^Message-Id" headers.
Make a copy of the tree (dont work on originals) and use a converter like msgconv or libOE to turn .msg to mbox or plaintext format before you let procmail+formail loose on it.
 
Old 07-19-2006, 04:42 AM   #4
morrolan
Member
 
Registered: Sep 2003
Location: Manchester UK
Posts: 264

Original Poster
Rep: Reputation: 30
Quick Update:

Thanks a lot, that helped, although I did have to rework it ever so slightly to work properly.

Here is what I ended up with:

Code:
find ./path/to/msgs -iname "*.msg" -printf messagelist.csv "%s,%P\r"
That gave me the size in bytes %s, a comma to seperate fields (useful for import to Excel or OOo), %P for the filename with arguments removed, and \r for a carriage return after every line.

For my test area of 9,800 emails, it took approximately 1 second to process on a 2.6Ghz celeron with 512Mb RAM.
 
Old 07-19-2006, 04:55 AM   #5
bigearsbilly
Senior Member
 
Registered: Mar 2004
Location: england
Distribution: Mint, Armbian, NetBSD, Puppy, Raspbian
Posts: 3,515

Rep: Reputation: 239Reputation: 239Reputation: 239
I got a perl script for finding duplicates.
It works by using cksum.
I use it for finding duplicate image and mp3 files.
It works OK. I'll dig it out if you like.

But as you say, won't differentiate same body different header.
 
Old 07-19-2006, 05:03 AM   #6
bigearsbilly
Senior Member
 
Registered: Mar 2004
Location: england
Distribution: Mint, Armbian, NetBSD, Puppy, Raspbian
Posts: 3,515

Rep: Reputation: 239Reputation: 239Reputation: 239
I reckon, I would parse all the mails, split header and body, and cksum the message bodies.
Then you could easy find the duplicates I reckon.

It's almost too easy
 
Old 07-20-2006, 11:24 AM   #7
morrolan
Member
 
Registered: Sep 2003
Location: Manchester UK
Posts: 264

Original Poster
Rep: Reputation: 30
bigearsbilly, I really would appreciate if you dug out that perl script, as the only thing I can use is bash, and badly at that!

The one thing I have to be aware of is that these are legal documents and all code used to parse them will need to be released to the courts, so if you don't want your code released, please do not feel bad for not digging it out for me.
 
Old 07-20-2006, 11:26 AM   #8
bigearsbilly
Senior Member
 
Registered: Mar 2004
Location: england
Distribution: Mint, Armbian, NetBSD, Puppy, Raspbian
Posts: 3,515

Rep: Reputation: 239Reputation: 239Reputation: 239
no problem.
I'll have a look tonight and send it on.
 
Old 07-20-2006, 06:05 PM   #9
bigearsbilly
Senior Member
 
Registered: Mar 2004
Location: england
Distribution: Mint, Armbian, NetBSD, Puppy, Raspbian
Posts: 3,515

Rep: Reputation: 239Reputation: 239Reputation: 239
here's a starting point...

Code:
#!/usr/bin/perl -w

# script for finding and listing duplicate, regular files
# 
# a duplicate is defined as two or more files having identical 
# results when using the 'cksum' program

use Uniq ;
$" = "\n" ;

my @sums ;
my @files;


open LIST , "find . -type f -exec cksum {\} \\;| " ;

while (<LIST>) {
    chomp;
    s/ /-/ ;
    s/ /:/ ;
    push @files, "$_";
    push @sums, (split /:/)[0];
}
@sums = sort @sums;
@files = sort @files;
# print "@sums\n" ;
# print "@files\n" ;


@sums =  dups sort @sums;
# # print "Dupes\n======\n@sums\n====\n" ;
print "# found ",  scalar @sums , " duplicates (counting once, not for each multiple )\n" ;
foreach $dup (@sums) {
    print "# ==================\n";
    my @L = grep  m/$dup/, @files  ;
    print "@L\n";
}
 
Old 10-26-2006, 09:33 AM   #10
matthewhardwick
Member
 
Registered: Oct 2003
Location: CA
Posts: 165

Rep: Reputation: 30
I am looking for a way to find duplicate MP3s I run a radio station, and well we have several thousand duplicated MP3s but the names are all different, is there anyway someone can point me in the write direction?
 
Old 10-26-2006, 10:46 AM   #11
Guttorm
Senior Member
 
Registered: Dec 2003
Location: Trondheim, Norway
Distribution: Debian and Ubuntu
Posts: 1,453

Rep: Reputation: 447Reputation: 447Reputation: 447Reputation: 447Reputation: 447
Hi

Just posted a little bash script i once wrote - it can be used to find duplicate files - filetype doesn't matter.

http://www.linuxquestions.org/questi...48#post2478148
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Identify Duplicate Files by File Name rickh Programming 8 06-21-2006 03:04 PM
msg bash script unkn0wn Programming 6 04-28-2006 04:30 AM
Bash - Deleting duplicate records Wire323 Programming 5 12-04-2005 08:51 AM
How to identify files in /sys/bus/i2c/devices/ koyi Linux - Hardware 0 07-18-2005 01:55 AM
slooze error: validateInputs(); $msg = $mySlooze->renderPage($vars); echo $msg; ?> rioguia Linux - Software 0 01-26-2003 08:59 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 10:49 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration