[SOLVED] If OSI layers correct errors, how can an http download still fail md5sum?

Ulysses_ · 01-26-2014, 12:26 PM

Just downloaded a 3 GB iso image using http and tested it with md5sum. It failed. Downloaded the file again and the md5 sum tests ok now

Errors are supposed to be handled by some of the layers of the OSI model and not just the physical layer, in order to provide a reliable service over an unreliable physical medium.

Why might such a thorough scheme have failed to correct the error in the download above?

Ser Olmy · 01-26-2014, 04:09 PM

The error most likely occurred at one of the endpoints.

As you say, both TCP data and IP/TCP headers are protected by checksums. While these are not foolproof, the chance of a datagram with a random error generating the same checksum as the uncorrupted version of that same datagram is very slim indeed.

However, bit errors occurring before the web server served the data, or after the data was received by the client, could still go unnoticed. This would include:

errors generated by the hard drive(s) on the server (not terribly likely as the data are checksummed, but it has been known to happen)
errors on the SATA bus or SATA chipset (the latter is not all that uncommon, while the former is usually caught by parity checking)
faulty RAM on the server, especially if non-ECC RAM is used
faulty server NIC (especially if the NIC supports TCP checksum offloading)
NIC/PCIe bus/RAM/SATA chipset/hard drive errors on the client
and last but not least: bugs in the web browser, especially with regards to cache handling

Considering the vast number of components involved in such a seemingly simple transaction, it's really remarkable that we don't see errors like that more often.

Ulysses_ · 01-27-2014, 02:19 AM

Thanks. If server hardware produces errors, then wouldn't that be noticed and fixed? Or is a mean time between errors of a few hours acceptable and common?

Ser Olmy · 01-27-2014, 02:30 AM

If the errors occur before the data is segmented by the IP stack, and if no other mechanism detect those errors, they won't be noticed nor corrected. The most common scenario is bad non-ECC RAM; unless programs segfault of kernels crash, you'll never know it happened unless you verify the data (as you did).

Buggy or faulty chipsets can also introduce errors. Parity checking on the Hypertransport bus or the PCIe bus is of no use if the data corruption occurs before the data is put on the bus.

No, a mean time of a few hours between errors is not acceptable, it indicates faulty hardware or seriously buggy software. On servers with ECC memory I see perhaps one correctable error per year.

Ulysses_ · 01-27-2014, 07:31 AM

Now I'm getting worried about my PC ram and other hardware too. How do I know all existing files are ok? Or whether new ones I create myself are corrupted?

Ser Olmy · 01-27-2014, 07:44 AM

Run memtest86+ overnight. If there are no errors, you can be reasonably sure your RAM is OK.