bash: identify duplicate MS Outlook .msg files?

morrolan · 07-17-2006, 04:54 AM

HI all,
I'm looking for a way to de-duplicate MS Outlook .msg files, contained within one main directory but with sub-directories going down several levels.

The problem is - they ALL have to be compared to each other. I was considering creating a .csv file with each filename and md5sum on a line, and then importing to OOo or Excel and working out how to identify duplicates there.

My problem is, I'm looking at about 120,000 - 180,000 emails involved in a high-profile litigation case.

I did some testing of md5sums of emails sent to various people across various servers (identical body etc) but it appears that the headers are modified and this appears to prevent md5 from being a viable option (my theory at least).

If anyone could provide any help, it would be greatly appreciated.

Morrolan

jschiwal · 07-17-2006, 05:28 AM

When you say duplicates, do you mean that the files are identical or that the body is the same?
Would two or more duplicate .msg files have the identical names?

There is a "cmp" program that can compare two files. If two duplicates are identical, consider making a table of file and file sizes. Then you will only need to compare files of identical sizes.

One way of doing this would be: find ./path/to/msgs -iname "*.msg" -fprintf messagelist.tsv "%s\t%P"
This would produce a tab seperated value file of two columns. The size in bytes and the filename.
find ./path/to/msgs -iname "*.msg" -printf "%s\t%P" | sort -k1 -n >messagelist.tsv

The second example produces a numerically sorted list. However, I don't know how the sort program would handle such a large stream. Commands like sort and awk. You could use a comma to separate files if you want. If you do, you need to use an option to use a space to separate fields in the sort command.

The

unSpawn · 07-17-2006, 05:41 AM

Each sent email gets a uniq header, so the first thing I would do is look for duplicate "^Message-Id" headers.
Make a copy of the tree (dont work on originals) and use a converter like msgconv or libOE to turn .msg to mbox or plaintext format before you let procmail+formail loose on it.

morrolan · 07-19-2006, 04:42 AM

Quick Update:

Thanks a lot, that helped, although I did have to rework it ever so slightly to work properly.

Here is what I ended up with:

Code:

find ./path/to/msgs -iname "*.msg" -printf messagelist.csv "%s,%P\r"

That gave me the size in bytes %s, a comma to seperate fields (useful for import to Excel or OOo), %P for the filename with arguments removed, and \r for a carriage return after every line.

For my test area of 9,800 emails, it took approximately 1 second to process on a 2.6Ghz celeron with 512Mb RAM.

bigearsbilly · 07-19-2006, 04:55 AM

I got a perl script for finding duplicates.
It works by using cksum.
I use it for finding duplicate image and mp3 files.
It works OK. I'll dig it out if you like.

But as you say, won't differentiate same body different header.

bigearsbilly · 07-19-2006, 05:03 AM

I reckon, I would parse all the mails, split header and body, and cksum the message bodies.
Then you could easy find the duplicates I reckon.

It's almost too easy

morrolan · 07-20-2006, 11:24 AM

bigearsbilly, I really would appreciate if you dug out that perl script, as the only thing I can use is bash, and badly at that!

The one thing I have to be aware of is that these are legal documents and all code used to parse them will need to be released to the courts, so if you don't want your code released, please do not feel bad for not digging it out for me.

bigearsbilly · 07-20-2006, 11:26 AM

no problem.
I'll have a look tonight and send it on.

bigearsbilly · 07-20-2006, 06:05 PM

here's a starting point...

Code:

#!/usr/bin/perl -w

# script for finding and listing duplicate, regular files
# 
# a duplicate is defined as two or more files having identical 
# results when using the 'cksum' program

use Uniq ;
$" = "\n" ;

my @sums ;
my @files;


open LIST , "find . -type f -exec cksum {\} \\;| " ;

while (<LIST>) {
    chomp;
    s/ /-/ ;
    s/ /:/ ;
    push @files, "$_";
    push @sums, (split /:/)[0];
}
@sums = sort @sums;
@files = sort @files;
# print "@sums\n" ;
# print "@files\n" ;


@sums =  dups sort @sums;
# # print "Dupes\n======\n@sums\n====\n" ;
print "# found ",  scalar @sums , " duplicates (counting once, not for each multiple )\n" ;
foreach $dup (@sums) {
    print "# ==================\n";
    my @L = grep  m/$dup/, @files  ;
    print "@L\n";
}

matthewhardwick · 10-26-2006, 09:33 AM

I am looking for a way to find duplicate MP3s I run a radio station, and well we have several thousand duplicated MP3s but the names are all different, is there anyway someone can point me in the write direction?

Guttorm · 10-26-2006, 10:46 AM

Hi

Just posted a little bash script i once wrote - it can be used to find duplicate files - filetype doesn't matter.

http://www.linuxquestions.org/questi...48#post2478148