MDADM and bit rot

road hazard · 03-16-2017, 01:02 PM

Mods, being a n00b, if this fits better in that section, feel free to move it.

I've seen more than one discussion pertaining to this topic:

http://unix.stackexchange.com/questi...ion-with-mdadm

Something about how MDADM's scrubs really don't fix errors and stuff about MDADM not verifying parity on reads (only writes). Any truth to all this? Maybe it's a bug that's been addressed since I can't find any recent discussions, only stuff from 2008 to about 2013.

thordn · 03-16-2017, 04:23 PM

Quote:

Originally Posted by road hazard

Mods, being a n00b, if this fits better in that section, feel free to move it.

I've seen more than one discussion pertaining to this topic:

http://unix.stackexchange.com/questi...ion-with-mdadm

Something about how MDADM's scrubs really don't fix errors and stuff about MDADM not verifying parity on reads (only writes). Any truth to all this? Maybe it's a bug that's been addressed since I can't find any recent discussions, only stuff from 2008 to about 2013.

For raid5 there is no way to know which block is bad if you do not get an error reported from the disk, for raid6 there is a possibility to recover, but I cannot say if the current MDADM uses it or not. I normally do a md5 or sha checksum of all files on an array so I later can see if there have been any corruption (and on what file).

When I started using RAID5 you could quite often get silent corruption of the data due to bandwidth problems on the motherboard or because the system is gradually becoming unreliable etc. So to have an external checksum is recommended so you can at least know that the system is good condition.

road hazard · 03-16-2017, 05:59 PM

Quote:

Originally Posted by thordn

For raid5 there is no way to know which block is bad if you do not get an error reported from the disk, for raid6 there is a possibility to recover, but I cannot say if the current MDADM uses it or not. I normally do a md5 or sha checksum of all files on an array so I later can see if there have been any corruption (and on what file).

When I started using RAID5 you could quite often get silent corruption of the data due to bandwidth problems on the motherboard or because the system is gradually becoming unreliable etc. So to have an external checksum is recommended so you can at least know that the system is good condition.

I currently use RAID 6 with my MDADM setup. What is this checksum voodoo you speak of?

thordn · 03-16-2017, 07:02 PM

Quote:

Originally Posted by road hazard

I currently use RAID 6 with my MDADM setup. What is this checksum voodoo you speak of?

Typically I do something like:

cd <root of structure i want to check>

find . -type f -exec md5sum {} \; >md5sum.sum

Which may take many hours depending on size to check, as for me the md5sum.sum file can be some 500 MB

Then to check you do:

md5sum -c md5sum.sum >md5check.txt

grep FAIL md5check.txt | more

If you then got a fail on a file you know is not modfied or you get different fails a second run you know your setup has problems.

road hazard · 03-16-2017, 08:10 PM

Quote:

Originally Posted by thordn

Typically I do something like:

cd <root of structure i want to check>

find . -type f -exec md5sum {} \; >md5sum.sum

Which may take many hours depending on size to check, as for me the md5sum.sum file can be some 500 MB

Then to check you do:

md5sum -c md5sum.sum >md5check.txt

grep FAIL md5check.txt | more

If you then got a fail on a file you know is not modfied or you get different fails a second run you know your setup has problems.

Thanks for the info! Doesn't seem too complicated but if that link I originally posted is true, I sure do wish mdadm could be updated to do some repairing during a scrub.

Unless you know how, I think I'll look into automating that and only sending me an email if there are any failures.