help determining RAID fail event
I recently received an email stating a "FAIL event has been detected on md device /dev/md0" whereby it looks like it went back to resyncing everything from the good drive, but upon checking smartctl there does not seem to indicate anything issues with the bad drive so I'm wondering if this needs to be investigated further or if it was a false alarm. if it is, how can I get rid of the flag in mdadm that states the drive is faulty?
here's some relevant information when this event was triggered Code:
A Fail event has been detected on md device /dev/md0. Code:
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.4.208] (local build) |
From I read, you don't have smartctld configured to run self tests on your devices. You aren't the only one; mine was mis-configured not that long ago (<1year).
That's not good; device self-tests are how you are given early notice that something's about to break or just broke. Take a look at the comments in /etc/smartd.conf to see what's possible. I've got (pretty much) the following in my /etc/smartd.conf: Code:
DEVICESCAN -a -o on -S on -n standby,q -s (S/../.././02|L/../../6/03) Code:
# smartctl -a /dev/sda EDIT: You can trigger such tests by hand as well; it's best to have the daemon do that work for you. |
My experience at work with hardware RAID controllers is the failure tag won't disappear until the array is fully rebuilt.
That does not address what triggered the original alert. The date stamp on the email should provide help where to start looking in the system logs. I agree with Richard Cranium to configure cron with automated smartd self tests. I do this with servers at work. I scheduled daily short tests and weekly long tests. Another weekly cron job grabs the smartctl output and sends an email. I do likewise with some basic weekly RAID emails. Disclaimer: I am not a RAID guru and don't play one on TV. |
Quote:
|
Quote:
|
@Richard Cranium, thank you for mentioning smartd.conf as I did not have that configured properly at all to monitor my devices!
@upnort, good idea on matching the email's timestamp and the system logs. I've attached a snippet of it in case, but I've been experimenting with rtcwake to put the system to "freeze" state until 23:59.59 whereby the system would come out of sleep and after about 10 minutes begin to do an rsnapshot backup. I'm wondering if the sdc didn't wake up quick enough for the raid that it decided to resync. that said, according to mdadm --detail the raid has been rebuilt, however the drive is still in fault mode... Code:
Apr 4 00:00:11 defiant kernel: [ 5853.713271] Freezing user space processes ... (elapsed 0.001 seconds) done. Code:
{~}# mdadm --detail /dev/md0 |
Quote:
Code:
# First ATA/SATA or SCSI/SAS disk. Monitor all attributes, enable Code:
# HERE IS A LIST OF DIRECTIVES FOR THIS CONFIGURATION FILE. Code:
DEVICESCAN -a -o on -S on -n standby,q -s (S/../.././02|L/../../6/03) For every smart capable device in the system:
PLUS..
EDIT: I forgot to mention that smartd will also log to syslog if something bad happened in a test. If you already have a log-scraping tool that looks for things to alarm, then you can use that to send a warning email. smartd can also be configured to email warnings as well. |
Quote:
|
Keep in mind there is a simple
Code:
DEVICESCAN Otherwise, when you put your new DEVICESCAN line at the bottom of the file, the config file parser doesn't bother to look at it. Zero guesses on how I know that tidbit. |
Quote:
Quote:
|
Quote:
|
One more thing, I've put this into my /etc/rc.d/rc.local (which is more RAID related than smartd related)...
Code:
# Increase timeouts for all non-ERC drives. |
Quote:
|
To be honest, I'm fairly certain that someone else on the forum mentioned the timeout mismatch; I don't remember who did so or when they did.
While it's possible that I ran across this while reading the RAID wiki, the mere fact that this is the first time I'm posted anything about it, tells me that someone else beat me to the punch. Hopefully they'll show up and tell us when I saw said mention. |
All times are GMT -5. The time now is 10:26 PM. |