ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
HI all,
I'm looking for a way to de-duplicate MS Outlook .msg files, contained within one main directory but with sub-directories going down several levels.
The problem is - they ALL have to be compared to each other. I was considering creating a .csv file with each filename and md5sum on a line, and then importing to OOo or Excel and working out how to identify duplicates there.
My problem is, I'm looking at about 120,000 - 180,000 emails involved in a high-profile litigation case.
I did some testing of md5sums of emails sent to various people across various servers (identical body etc) but it appears that the headers are modified and this appears to prevent md5 from being a viable option (my theory at least).
If anyone could provide any help, it would be greatly appreciated.
When you say duplicates, do you mean that the files are identical or that the body is the same?
Would two or more duplicate .msg files have the identical names?
There is a "cmp" program that can compare two files. If two duplicates are identical, consider making a table of file and file sizes. Then you will only need to compare files of identical sizes.
One way of doing this would be: find ./path/to/msgs -iname "*.msg" -fprintf messagelist.tsv "%s\t%P"
This would produce a tab seperated value file of two columns. The size in bytes and the filename.
find ./path/to/msgs -iname "*.msg" -printf "%s\t%P" | sort -k1 -n >messagelist.tsv
The second example produces a numerically sorted list. However, I don't know how the sort program would handle such a large stream. Commands like sort and awk. You could use a comma to separate files if you want. If you do, you need to use an option to use a space to separate fields in the sort command.
Each sent email gets a uniq header, so the first thing I would do is look for duplicate "^Message-Id" headers.
Make a copy of the tree (dont work on originals) and use a converter like msgconv or libOE to turn .msg to mbox or plaintext format before you let procmail+formail loose on it.
That gave me the size in bytes %s, a comma to seperate fields (useful for import to Excel or OOo), %P for the filename with arguments removed, and \r for a carriage return after every line.
For my test area of 9,800 emails, it took approximately 1 second to process on a 2.6Ghz celeron with 512Mb RAM.
I got a perl script for finding duplicates.
It works by using cksum.
I use it for finding duplicate image and mp3 files.
It works OK. I'll dig it out if you like.
But as you say, won't differentiate same body different header.
bigearsbilly, I really would appreciate if you dug out that perl script, as the only thing I can use is bash, and badly at that!
The one thing I have to be aware of is that these are legal documents and all code used to parse them will need to be released to the courts, so if you don't want your code released, please do not feel bad for not digging it out for me.
I am looking for a way to find duplicate MP3s I run a radio station, and well we have several thousand duplicated MP3s but the names are all different, is there anyway someone can point me in the write direction?
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.