How to create md5sum for a directory?
I just copied over a 100 GB directory with lots of other directories and files within that one directory. How can I make sure everything was copied fine and that nothing went wrong and I'm missing bits?
|
It is not possible to compute a checksum for an entire directory. You have to recursively check every single file. Two alternatives:
1) use find in conjunction with md5sum: Code:
find directory -type f -print0 | xargs -0 md5sum >> file.md5 Code:
md5sum -c /path/to/file.md5 Code:
md5deep -rl directory > file.md5 |
I typically use rsync to do the copies.
This program is designed to "synchronize files and directories." One of the tricks that it uses is MD5, which it uses to determine if a file needs to be copied. If you use this tool, it will probably accomplish your objective for you, with no further programming tricks required. |
Is there any way to made md5sum print out only the failed hashes and write it to a blank file? Because I have thousands of files and I can't look through thousands of lines to see which ones failed.
|
Code:
md5sum -c file.md5 | grep FAILED$ > failed_hashes |
Why not just use diff?
Code:
diff dir1 dir2 |
With so many files, md5sum will take forever, and will be difficult to work with. I would also recommend rsync for this. Normally, yes I would use md5sum, but for 100 GB ... mmm I dunno.
|
Quote:
Quote:
Quote:
|
Quote:
Quote:
Code:
rsync -avz -e ssh /path/to/local/dir user@host:/path/to/remote/dir |
Well, if you've already done the md5sum, then copy it over to the other directory and run 'md5sum -c' on it, and either output to a file or grep for FAILED.
|
Maybe worth noting this:
The command 'md5sum <file>' does not work for large files. I think 32GB is the limit (need to confirm the limit.) 'cat <file> | md5sum' always works. ----------------- If you are only interested in comparing two directories and not so much in learning-by-doing, you can just copy and use the following perl script. I wrote it out of daily need, and it's good enough. It is made a little elaborate in order to work for all filename/dirname characters in ascii range [32..255]. Just a newline in filename can fail simpler approaches. Usage is: perl -w dircompare.pl <orig-dir-path> <new-dir-path> The paths can be on different file systems. #! /usr/bin/perl -w use strict; use warnings; use Cwd; my $cwd = cwd; print "Current directory: $cwd\n"; my $hold = {}; my ($odir, $cdir) = (shift, shift); foreach my $dir ($odir, $cdir) { print "$dir\n"; chdir $dir || die "Error: Could not chdir to $dir\n"; my @list = `find . -type f -exec md5sum {} \\;`; my $h = {}; foreach (@list) { /^([^ ]+) (.*)$/; $h->{$2} = $1; } $hold->{$dir} = $h; chdir $cwd; } my @okeys = keys %{$hold->{$odir}}; print "Note: Original Directory has ", $#okeys + 1, " files\n"; my @ckeys = keys %{$hold->{$cdir}}; print " Compared Directory has ", $#ckeys + 1, " files\n\n\n"; my (@kdne, @md5mm, @exk); for (@okeys) { if (!exists ${$hold->{$cdir}}{$_}) { push @kdne, $_; next; } if (${$hold->{$cdir}}{$_} ne ${$hold->{$odir}}{$_}) { push @md5mm, $_; } } for (@ckeys) { if (!exists ${$hold->{$odir}}{$_}) { push @exk, $_; } } #LOGGING: print "Error Type A: missing file or read denied in $cdir ...\n"; if(!@kdne) { print " ... no errors\n"; } else { print "ErrorTypeA $_\n" for @kdne; } print "\n\n"; print "Error Type B: md5 mismatch between $cdir and $odir ...\n"; if(!@md5mm) { print " ... no errors\n"; } else { print "ErrorTypeB $_\n" for @md5mm; } print "\n\n"; print "Error Type C: Extra/modified paths in $cdir ...\n"; if(!@exk) { print " ... no errors\n"; } else { print "ErrorTypeC $_\n" for @exk; } print "\n\n END OF REPORT\n\n"; |
So sorry for kicking, but this is the first hit when searching for "md5sum directory" on Google and I hope I can help the next poor fool who finds this thread, searching with the wrong keywords. ;)
The rsync suggestion is perfect and can be executed like this: Code:
rsync -lrthvcn --delete /home/source/dir /home/destination/dir sending incremental file list sent 174.92K bytes received 118 bytes 143.77 bytes/sec and no filenames between those two lines, the contents are identical. Just so you know what you're running, -lrthvcn stands for: -l, --links copy symlinks as symlinks -r, --recursive recurse into directories -t, --times preserve modification times -h, --human-readable output numbers in a human-readable format -v, --verbose increase verbosity -c, --checksum skip based on checksum, not mod-time & size -n, --dry-run perform a trial run with no changes made --delete delete extraneous files from destination dirs (don't worry as it's a dry run. do note when actually syncing and ask yourself if you want this.) |
Like W3ird_N3rd, I was looking for the answer to "md5sum directory" on Google. While colucix' response provides the answer, the following basic extensions to his solution might be helpful for some:
To only get successful md5 sums into the checksum file (errors are written to the console): Code:
find directory -type f -print0 | xargs -0 md5sum 1>> file.md5 Code:
md5sum -c /path/to/file.md5 1> /dev/null |
wouldn't "tar -c . | md5sum" do the trick?
|
All times are GMT -5. The time now is 08:21 AM. |