LinuxQuestions.org - Software RAID 5 crash and wrongful failed disk flagged

Ok, I've just had a nasty hard-drive crash happen to me, and as Murphy would have it, it happened at the end of the semester when I have a ***load of work to hand-in Tuesday.

Thankfully, I had upgraded recently to Slackware 11, and still had the old copy of my previous setup laying on a soon-to-be erased partition on one of my disks... so it is not as bad as it could have been.

Now to the issue at hand:

I have 4 hard-drives, exactly identical WD 160 GB. Two of those are new, the others are 2 years old. One of my 18000 hours-old HD just died this afternoon, with no prefail indications prior to this. Now smartctl -a /dev/hdg reports that it's pretty much dead.

However, when it died, Software Raid, RAID 5 on 4 disks, kicked the wrong drive, /dev/hde, from the arrays, and kept (!!?!) the bad one!
When I came home, everything was frozen solid, not even a kernel panic message, nothing... Trying to restart things, /dev/md0 which is my root RAID5 device wouldn't initialise properly... So got the rescue CD out, and I was shocked, after quite a bit of fiddling, to extract the /var/log/syslog file and see that the wrong disk had been kept up...

Since activity went on, instead of stoping right there, all files which have been touched in any ways during the few hours it took for the system to finally die are corrupted. Forcing the reinsertion of the "good" disk in the arrays does enable the salvage of quite a lot of stuff, unfortunately, the databases for mysql are wrecked beyond repair.

So, does anyone have suggestions to actually prevent this kind of horror story to repeat itself with Software RAID... It's pretty much evident, from the SMART diagnostics, that /dev/hde was good all along. I still don't get why the bad disk wasn't brushed aside, which would have stopped all the arrays before loosing their sync.