-   Linux - Software (
-   -   Data Integrity Checks (

itnaa 12-19-2006 12:46 PM

Data Integrity Checks
Apologies upfront if this is a misplaced post...

I will be shuttling close to 30-50 GB data from one server to another, possibly over the network (LAN). It's going to be a periodic task, and I am investigating means of verifying the integrity of the transfer (since I will be erasing the data off the source).

A couple of questions:

1. Is md5sum the best means to verify the integrity of data transfers? Are there better/newer approaches?

2. Is there a function in the Perl library that will permit to check data integrity?

I hope these aren't stupid questions. Any response suggestions would be much appreciated. Thanks.

Itnaa Sarakaam
:study: :confused:

Electro 12-20-2006 01:23 PM

Look into dar. It is like tar but it is designed for backups. You can also use cpio using its error correction option.

anomie 12-20-2006 02:16 PM


Is md5sum the best means to verify the integrity of data transfers?
md5 would be appropriate for this task. (Alternatives include sha1 and sha256, etc.)

itnaa 12-21-2006 08:59 AM

Thanks Folks. Appreciate your responses. Will follow up on your leads & suggestions.

itnaa 12-21-2006 04:09 PM

... a bit of searching around... it appears that the key phrase is "message digest". It does appear that "they" would be safe? Ie, they are to be used to compute "uniqueness" as opposed to a focus on security. My impression is that SHA based algorithms are "better" than the MD5 ones...

If someone could point out to me the "safest" approach (as in uniqueness, as opposed to security), I would much appreciate it!!!!

A couple of links for the interested:

The former answers my question about Perl CPAN library functions. Both the above also have many other algorithms in play, but do linux implementations incorporate the others beyond MD5 & SHA algorithms?

An additional concern I have is typically how long would it take to generate the appropriate sum for a data block roughly DVD sized, and also if it were around 30GB-block? Are there any practical tips as how to approach generation of such sums for very large datasets? For eg, is it wise to generate these for the whole block or chop it up into smaller sub-blocks and then compute a series of such sums?

Any thoughts/tips would be much appreciated. Thanks.

itnaa 12-21-2006 04:19 PM

Thanks for the dar reference. I checked it out and it uses CRC for confirming integrity. I am a bit confused about their statement:

"... hanks to CRC (cyclic redundancy checks), dar is able to detect data corruption in the archive. Only the file where data corruption occurred will not be possible to restore, but dar will restore the others even when compression or encryption (or both) is used...."

Question: after it archives that data, wouldn't it check to see if the process was truly successful? And perhaps redo it, if not?

I would hate to be told, during restoration, that a file was corrupt. I'm naive & a newbie in these matters, perhaps I've misunderstood things?

Electro 12-21-2006 08:26 PM

When the backup is done, you have to test it to make sure it is correct, but do not test it on the same system. Test the backup on another similar computer. If it works, then you can state the condition as good.

Data will not always be 100% perfect. In the real world perfection is not any more than 99%, so data will have some corruption.

To keep the chances of data corruption down, put both computers on a line conditioner using excellent power supplies and using ECC and parity memory.

What dar can do is what they have said. If CRC or both compressiong and CRC is used and one or many areas in the dar archive is corrupted, it can resume restoring files in the archive. If you have incremental and differential backups, these files can then be hopefully saved if the full can not restore them.

Like the Nike slogan. Just do it.

itnaa 12-22-2006 02:28 PM

thanks electro. do appreciate your suggestions & thoughts. as you note... time to take the plunge...:)

All times are GMT -5. The time now is 11:58 AM.