LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Distributions > Slackware
User Name
Password
Slackware This Forum is for the discussion of Slackware Linux.

Notices


Reply
  Search this Thread
Old 01-16-2015, 08:07 AM   #1
rouvas
Member
 
Registered: Aug 2006
Location: Greece
Distribution: Slackware.12.2
Posts: 104
Blog Entries: 3

Rep: Reputation: 20
md5sum troubles on big files ( 80GB .ova files )


Hi all,

I have three machines A,B and C that I regularly use for the last 5 years or so.

A is Slackware.12.2 (32bit),
B is Slackware.13.37 (32bit)
C is CentOS.6.6 (64bit)

I regularly transfer files between them and perform various tasks on them, such as editing, database operations, compilations, web application development (mainly Java work), etc.

On machine A a VirtualBox guest (also Slackware alas 10.0) is running and periodically (maybe every couple of months) I take a full export in the form of a 80GB .ova file which I copy to machines B and C as a backup.

It never occurred to me to actually verify the integrity of the file until recently. On machine A (where the actual export if done) I have the last two .ova files (each about 80GB) and I computed an md5sum on them. I transferred the files to machines B and C and recomputed the md5sums.

To my complete surprise, md5sums on *each* machine are *different*.

Now I've been using these machines a long time, with a lot of traffic between them. Machines A and B are on my local LAN, while machine C is on a remote location. I suppose that if the network between them is bad I would have noticed (for example the WAR generated on A would be corrupted on B or C at least once per day since during development files fly between these machines). Similarly, if it is a RAM issue, I suppose I would have noticed.

I am still using these machines with no observable malfunctions, apart from the md5sum issue.

Does anyone has any clue about what might be off here?
 
Old 01-16-2015, 09:19 AM   #2
linosaurusroot
Member
 
Registered: Oct 2012
Distribution: OpenSuSE,RHEL,Fedora,OpenBSD
Posts: 982
Blog Entries: 2

Rep: Reputation: 244Reputation: 244Reputation: 244
After copying the file you could copy it again with rsync - which might correct differences introduced in transmission.

A ZFS/BRTFS system might show discrepancies coming from the disk system.

Memory testing the machines is probably a good idea too.
 
Old 01-16-2015, 11:01 AM   #3
mlslk31
Member
 
Registered: Mar 2013
Location: Florida, USA
Distribution: Slackware, FreeBSD
Posts: 210

Rep: Reputation: 76
Could be anything: file system differences, kernel config differences, old coreutils, network errors, memory errors, overheating, bad cables, switch going bad, bitness issues, and so on.

Is there a case where you can transfer the image from Machine B or C back to Machine A and have the md5sum pass?

You might use dd, head, or tail to make a smaller image to transfer (9 GB? 5 GB? 3 GB?) and see where the md5sums start to differ after transfer. Look at the output of 'ifconfig' and 'dmesg' to see if errors increase or the link speed changes, simply because you're going to be transferring these files, anyway.

If you're in doubt about the network switch (or hub), you could try a different switch or a card-to-card connection.

If you're a fan of coreutils, you could upgrade it to be the same version on all PCs, deal with some configuration issues here and there. Or at the very least, just build coreutils without installing it and use the md5sum program that was built.

What are your file systems for the source and destinations?

On one of my setups, I get a different md5sum result on JFS with different kernels. One kernel has CONFIG_LBDAF=y--support for files over 2 TB, I think--and the other does not have it set. Then again, the partition on which that JFS resides may be a little bit sketchy. This is easily checked with 'zcat /proc/config.gz | grep LBDAF'. The tarballs I'm checking were not large--under 4 GB--but they'd untar successfully on one kernel and abort on the other kernel.
 
Old 01-16-2015, 12:03 PM   #4
rg3
Member
 
Registered: Jul 2007
Distribution: Fedora
Posts: 527

Rep: Reputation: Disabled
This happened to me a couple of times.

In the first case it was a hardware problem due to bad memory, as suggested above. Running memcheck for a night revealed the error in one of the passes.

In the second case it was simply the file being transferred over FTP and the server/client corrupting the file due to CR/LF mangling. A classic.
 
Old 01-16-2015, 12:25 PM   #5
veerain
Senior Member
 
Registered: Mar 2005
Location: Earth bound to Helios
Distribution: Custom
Posts: 2,524

Rep: Reputation: 319Reputation: 319Reputation: 319Reputation: 319
To test memory run memtest86 or memtest86+.

You should use a software like rsync for transfers. Since plain network tranfer is not reliable. TCP transfer connection can have uncatched erros. For example torrents have hash of its divided small segments for error checking. After transfer you should use sha1 or sha256 atleast.
 
Old 01-17-2015, 09:52 AM   #6
rouvas
Member
 
Registered: Aug 2006
Location: Greece
Distribution: Slackware.12.2
Posts: 104
Blog Entries: 3

Original Poster
Rep: Reputation: 20
I selected one of the .ova files and 'split -d -b 1G' into 80 files of 1GB each and md5sum'd them.
I then used scp both the big 80GB .ova file and the 80 smaller files from machine A to machine B and verified the md5sum.

Somewhat predictably, the big 80GB .ova failed along with 20 out of 80 smaller 1GB files.

So, what am I supposed to do? Do I repeat the process until I am lucky and no errors are reported?
How do you people transfer big files around?
 
Old 01-17-2015, 09:57 AM   #7
Didier Spaier
LQ Addict
 
Registered: Nov 2008
Location: Paris, France
Distribution: Slint64-14.2 on Lenovo Thinkpad W520
Posts: 7,632

Rep: Reputation: 2571Reputation: 2571Reputation: 2571Reputation: 2571Reputation: 2571Reputation: 2571Reputation: 2571Reputation: 2571Reputation: 2571Reputation: 2571Reputation: 2571
Quote:
Originally Posted by rouvas View Post
How do you people transfer big files around?
With rsync.
 
Old 01-17-2015, 10:20 AM   #8
JackHair
Member
 
Registered: Aug 2009
Location: Netherlands
Distribution: Slackware64-current
Posts: 167

Rep: Reputation: 38
Quote:
Originally Posted by mlslk31 View Post
Could be anything: file system differences, kernel config differences, old coreutils, network errors, memory errors, overheating, bad cables, switch going bad, bitness issues, and so on.
There should not be a difference in md5 with any kernel or filesystem w/e you use. The whole point of md5 etc. is that it's always the same. If it isn't the same the file is broken. If it could differ between kernels etc. there would be no point in md5sums.
 
Old 01-17-2015, 11:45 AM   #9
rg3
Member
 
Registered: Jul 2007
Distribution: Fedora
Posts: 527

Rep: Reputation: Disabled
Quote:
Originally Posted by rouvas View Post
I selected one of the .ova files and 'split -d -b 1G' into 80 files of 1GB each and md5sum'd them.
I then used scp both the big 80GB .ova file and the 80 smaller files from machine A to machine B and verified the md5sum.

Somewhat predictably, the big 80GB .ova failed along with 20 out of 80 smaller 1GB files.

So, what am I supposed to do? Do I repeat the process until I am lucky and no errors are reported?
How do you people transfer big files around?
If you used scp to transfer files, they should have arrived correctly. The split you performed guarantees there are no problems due to file size limits. This could happen if the destination filesystem was FAT32, for example, or if the software you used did not have large file size support, which would be weird because I think the scp that Slackware ships doesn't have that problem in neither the 32-bit nor 64-bit version.

The highest probable cause, IMHO, is a bad network cable, bad memory or bad hard drive in one of the computers. Change network cables, run several passes of memtest86+ from the Slackware installation media to check for bad memory (usually overnight as it takes some time), or check the source and destination drives with the badblocks command.

Last edited by rg3; 01-17-2015 at 11:47 AM.
 
Old 01-17-2015, 01:00 PM   #10
mlslk31
Member
 
Registered: Mar 2013
Location: Florida, USA
Distribution: Slackware, FreeBSD
Posts: 210

Rep: Reputation: 76
Quote:
Originally Posted by JackHair View Post
There should not be a difference in md5 with any kernel or filesystem w/e you use. The whole point of md5 etc. is that it's always the same. If it isn't the same the file is broken. If it could differ between kernels etc. there would be no point in md5sums.
True. Were this a 50-MB tarball, I'd absolutely agree. But it seems like every release of coreutils (and tar) has some improvement to either its large-file support or its sparse-file handling. This 80-GB beast might be a bit of both. Where these improvements change things, I wouldn't know without re-reading every Changelog and doing a before/after scenario. It's faster to just build without installing--under a half hour if you have the source--and try it.
 
Old 01-17-2015, 01:15 PM   #11
mlslk31
Member
 
Registered: Mar 2013
Location: Florida, USA
Distribution: Slackware, FreeBSD
Posts: 210

Rep: Reputation: 76
Quote:
Originally Posted by rouvas View Post
I selected one of the .ova files and 'split -d -b 1G' into 80 files of 1GB each and md5sum'd them.
I then used scp both the big 80GB .ova file and the 80 smaller files from machine A to machine B and verified the md5sum.

Somewhat predictably, the big 80GB .ova failed along with 20 out of 80 smaller 1GB files.

So, what am I supposed to do? Do I repeat the process until I am lucky and no errors are reported?
How do you people transfer big files around?
Depends. I'm almost leaning towards having you shut the PC off for 10 minutes, turn it back on and see if you have better luck. Or shut the PC off for a minute, boot into a memtest+ kernel, let it run a full run, shut the PC off for another minute, and boot into your regular system.

Of those 20 of 80 smaller 1GB files, can you copy them from machine B back to machine A and have the md5sum come out OK? Run 'ifconfig' before and after the transfer to see if the error rate has gone up.

As for how I transfer large files around, it depends. Mostly I use rsync, sometimes scp, sometimes nc. Here at work, I use smbclient to transfer files from the server to a Windows client box, or use regular Windows networking (via Samba) from the Windows clients. If I have an NFS share exported at home, I'll use cp. ftp is in play at times as well. Depends on how much I care about how well the boxes are integrated.

What matters here is that the data gets there and that nothing overheats in the process. Some of the older cards at home, with worn parts and somewhat neglected drivers, don't do so well...all to store the data on worn drives. It's trial and error until everything works, then leave it all alone.
 
Old 01-17-2015, 01:23 PM   #12
unSpawn
Moderator
 
Registered: May 2001
Posts: 29,359
Blog Entries: 55

Rep: Reputation: 3545Reputation: 3545Reputation: 3545Reputation: 3545Reputation: 3545Reputation: 3545Reputation: 3545Reputation: 3545Reputation: 3545Reputation: 3545Reputation: 3545
Quote:
Originally Posted by rouvas View Post
(..) I take a full export in the form of a 80GB .ova file which I copy to machines B and C as a backup.
I wouldn't copy the .ova (unless you have some compelling reason for doing so) but rsync from within as it's not efficient to transfer items that haven't changed.


Quote:
Originally Posted by rouvas View Post
I am still using these machines with no observable malfunctions, apart from the md5sum issue. Does anyone has any clue about what might be off here?
If this is a Live guest that's getting copied then data within that image (/proc contents, shell history, login records, logs) will change on the fly.


Quote:
Originally Posted by linosaurusroot View Post
A ZFS/BRTFS system might show discrepancies coming from the disk system.
@OP: if you can use ZFS look into snapshots. Highly efficient.


*Wrt 'split': note there's '{md5,sha1}deep' that will allow you to perform piece-wise (ranging from byte up to petabyte size) hashing of files. Knowing the offset inside the image helps determining if this is about changing Live data or Something Completely Different...
 
2 members found this post helpful.
Old 01-22-2015, 04:14 AM   #13
rouvas
Member
 
Registered: Aug 2006
Location: Greece
Distribution: Slackware.12.2
Posts: 104
Blog Entries: 3

Original Poster
Rep: Reputation: 20
So the saga continues, but let me give some more info.

Machine A, where the big 80GB .ova file is generated, is a "server-grade" HP Proliant, 2GB ECC RAM, Intel Xeon, runs Slackware.10.0.32bit with and ext3 filesystem.

Machine B, where all the trouble is, is my main working machine, just a white-box I've assembled having 8GB of RAM and a AMD Phenom II X2 555. It resides in my local LAN along with "A", runs Slackware.13.37.32bit with an ext4 filesystem.

Machine C, is at a remote location (another country) and is a "server-grade" SuperMicro with 16GB of RAM and a Intel Core i7-3770, runs a CentOS.6.6.64bit with an ext4 filesystem.

I took one of the 80GB .ove fils, did a 'split -d -b 1G' into 80 files of 1GB each and md5sum'd both the big file as well as the smaller ones.

Initially plain FTP was used to transfer the files from "A" to "B" and from "A" to "C".
Since this produced errors, I tried 'scp'.
Transferring from "A" to "C", went fine with md5sums verified.
Alas, transferring from "A" to "B", failed once again.

Since rsync was suggested as a preferable method (BTW, why? AFAIK it does not offer any kind of error correction), I tried that as well for transferring files from "A" to "B", failed.

I went on and on transferring files from "A" to "B" trying, plain FTP, HTTPS (through wget), rsync, scp, all failed.
I tried to copy the 80GB file seperately from the 80 1GB files, fail.
I tried to copy only the small files that md5sum failed in the previous run, to no avail. Funny thing is, every time I checked the 80 small files (after transferring only the ones that the previous run reported as corrupted) a different set of files was being reported as corrupted!

Neverthless, I banged "A" and "B" transferring files with all the above methods for 3 days and never did I ended with a non-corrupted set.

Now, since "C" has received a non-corrupted copy from "A", I had established that whatever "A" was holding it could be verified at another machine ("C").

I gave up on "B" and assembled another machine ("D") from various spare parts that were lying around in the lab (I even found a 160GB IDE drive!), installed Slackware.14.1.32bit with ext4 and transferred the 80GB .ova file along with small ones from "A" to "D" using scp, and it worked! No errors!

I then transferred the files from "D" to "B" and bang! "B" fails again!

Before everyone jumps on me and suggests that there is something wrong with the hardware at machine "B", let me point out that I am working almost all day at "B". I have MySQL and PostgreSQL databases that I am using, I fire JBoss, Wildfly and Tomcat instances along with the usual Apache, I have sendmail running, I do compiles, I fire up a couple of VMs, I transfer files around with other machines in the lab and out of it... not all of them at the same time of course, but my point is that if there ware something wrong with the hardware **I would have noticed**.

The network is going through two HP Procurve switches supporting about 20 machines with active users in them, if that was the problem **I would have noticed**

I don't know what else to try. It seems that whatever I do, i just cannot get that 80GB .ove file into machine "B"!

Could it be that the md5sum on Slackware.13.37 is the culprit?

Next, I'll write a small program that generates the same file, run it on each of the machines and compute independently the md5sums and I'll post the results.

Meanwhile, thank you all for your suggestions.
 
Old 01-22-2015, 06:06 AM   #14
Labinnah
Member
 
Registered: May 2014
Location: Łódź, Poland
Distribution: Slackware-current
Posts: 35

Rep: Reputation: 7
For me, it looks like hard dive error. Check SMART for errors. Even if there is nothing there, it can be still hard drive problem.

You can create big file filled with zeros and check if it always read zero. And if not, check if nonzero value is always in same place.
 
Old 01-22-2015, 06:53 AM   #15
55020
Senior Member
 
Registered: Sep 2009
Location: Yorks. W.R. 167397
Distribution: Slackware
Posts: 1,098
Blog Entries: 4

Rep: Reputation: 1456Reputation: 1456Reputation: 1456Reputation: 1456Reputation: 1456Reputation: 1456Reputation: 1456Reputation: 1456Reputation: 1456Reputation: 1456
Quote:
Originally Posted by rouvas View Post
Before everyone jumps on me and suggests that there is something wrong with the hardware at machine "B", let me point out that I am working almost all day at "B". I have MySQL and PostgreSQL databases that I am using, I fire JBoss, Wildfly and Tomcat instances along with the usual Apache, I have sendmail running, I do compiles, I fire up a couple of VMs, I transfer files around with other machines in the lab and out of it... not all of them at the same time of course, but my point is that if there ware something wrong with the hardware **I would have noticed**.
Not necessarily, no, unless it's something that checksums its work obsessively -- i.e. git. Nothing else comes close. I lost a weekend once in exactly this scenario because of bad memory. Git was giving bizarre sha256 errors. Everything else *seemed* to work fine. So I ran memtest86+, and there was a one bit error in 64k of one DDR2 stick. I worked round it with a kernel boot parameter, but the stick quickly got worse and had to be replaced.

Your experiment has *proven* that the problem is (1) nondeterministic, and (2) confined to machine "B". What do we know that fits that profile? Bad memory on machine "B".

Disregard any observations you have made about ftp'd files; ftp doesn't do well on enormous files because it assumes that the network is reliable and simply splurges the plaintext. TCP/IP checksums its packets, but over 80Gb you can have plenty of false negatives. scp does not have this problem because a network error will cause a failure to decrypt which will result in retransmission. rsync is all-round awesome, it can checksum and correct in chunks like you have just done manually.

Also, consider that the memory error rate of any activity you perform on machine "B" will be proportional to the volume of data that it processes. Your file transfers and checksums have performed error detection on many hundreds of Gb. JBoss, Wildfly and Tomcat will certainly *not* be processing hundreds of Gb, and almost certainly won't notice if occasional bits have flipped.

I doubt it's the disk, because that would have caused noticeable pauses, but you can rule that out by looking at dmesg and looking at 'smartctl -a <device>' and 'smartctl -t long <device>'. See 'man smartctl' for details.

So:

Read my lips. It's your memory on machine "B". Run memtest86+ on machine "B". Have you run memtest86 on machine "B" yet? Run it again. Run it for longer. Run both memtest86 and memtest86+. And then come back here and say "thank you".

And then meditate upon the thought that all your JBoss, Wildfly, Tomcat, compiles, and transfers have potentially been corrupted.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] CentOS VM install: Need OVA file, can find only .torrent files gtaylor828 Linux - Newbie 23 01-08-2015 12:19 PM
[SOLVED] how to archive torrent file + the iso files , like the ova archive of virtual box jheengut Linux - Software 2 12-22-2013 05:04 PM
[SOLVED] Help - trying to MD5SUM files after removing specific lines from some of the files. sk1ds Programming 5 05-10-2013 09:36 AM
md5sum to check files? serendipitysdc Fedora - Installation 2 06-22-2004 08:12 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Distributions > Slackware

All times are GMT -5. The time now is 11:30 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration