Software Raid Down / Inactive - Need help troubleshooting / Recovering
Linux - ServerThis forum is for the discussion of Linux Software used in a server related context.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Software Raid Down / Inactive - Need help troubleshooting / Recovering
Hello All,
Last night I found my software raid 5 to be down. It is on a Server that is running Debian Squeeze 6.0.9 using 4x 3tb drives. I am new with these and could really use some help troubleshooting this to get it back up. I am a newer Linux user though I have used multiple distros over the years for basic servers I am still a amateur and this current server is a file store for my home windows network.
Below is the info I gathered so far.
Any help would be truly appreciated.
Code:
# lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description: Debian GNU/Linux 6.0.9 (squeeze)
Release: 6.0.9
Codename: squeeze
Code:
# mdadm --query /dev/md0
/dev/md0: is an md device which is not active
Based on Update Time sdd dropped out Aug 3, and drive sdb dropped out Aug 6.
Once two drives fail the RAID fails. Your best bet is to force the assemble on sdb, sdc, sde, with sdd missing. Last add the stale sdd drive and resync to it. Hopefully you won't have lost anything too important.
Enterprise RAID drives have time-limited error recovery. Consumer drives assume they are non-RAID so will try for a long time before giving up. This results in timeouts on the system and failing the drive out of the RAID. Check the system log to see what caused the drive to fail. If it was a timeout, you can try increasing the default timeout from 60 seconds to about 2 minutes to give recovery a chance. For example:
echo 120 >/sys/block/sdd/device/timeout
That might help reduce the problem in the future.
Please post the drive models and firmware versions from /proc/scsi/scsi
Based on Update Time sdd dropped out Aug 3, and drive sdb dropped out Aug 6.
Once two drives fail the RAID fails. Your best bet is to force the assemble on sdb, sdc, sde, with sdd missing. Last add the stale sdd drive and resync to it. Hopefully you won't have lost anything too important.
Thanks so much for the reply! Sorry again I am a newb. How do you recommend I do that? What commands should I run?
Quote:
Originally Posted by smallpond
Enterprise RAID drives have time-limited error recovery. Consumer drives assume they are non-RAID so will try for a long time before giving up. This results in timeouts on the system and failing the drive out of the RAID. Check the system log to see what caused the drive to fail. If it was a timeout, you can try increasing the default timeout from 60 seconds to about 2 minutes to give recovery a chance. For example:
echo 120 >/sys/block/sdd/device/timeout
That might help reduce the problem in the future.
I found this in the SMART log. I tried to condense the errors to the most meaningful over the past few days:
Code:
Sunday, August 03, 2014 9:09:26 AM Vault smartd[2162]: Device: /dev/disk/by-id/scsi-SATA_ST320410A_6FG07QVG [SAT] SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 57 to 58
Tuesday, August 05, 2014 12:39:26 AM Vault smartd[2162]: Device: /dev/disk/by-id/scsi-SATA_ST320410A_6FG07QVG [SAT] SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 59 to 58
Sunday, August 03, 2014 8:09:31 AM Vault smartd[2162]: Device: /dev/disk/by-id/wwn-0x5000c5005030fbb8 [SAT] SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 116 to 119
Tuesday, August 05, 2014 8:39:27 PM Vault smartd[2162]: Device: /dev/disk/by-id/wwn-0x5000c5005030fbb8 [SAT] SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 120 to 102
Sunday, August 03, 2014 5:39:27 PM Vault smartd[2162]: Device: /dev/disk/by-id/wwn-0x5000c5005cbf5e72 [SAT] failed to read SMART Attribute Data
Sunday, August 03, 2014 5:39:27 PM Vault smartd[2162]: Device: /dev/disk/by-id/wwn-0x5000c5005cbf5e72 [SAT] not capable of SMART self-check
Sunday, August 03, 2014 5:39:28 PM Vault smartd[2162]: Device: /dev/disk/by-id/wwn-0x5000c5005cbf5e72 [SAT] Read SMART Self Test Log Failed
Sunday, August 03, 2014 5:39:28 PM Vault smartd[2162]: Device: /dev/disk/by-id/wwn-0x5000c5005cbf5e72 [SAT] Read Summary SMART Error Log failed
Sunday, August 03, 2014 8:09:30 AM Vault smartd[2162]: Device: /dev/disk/by-id/wwn-0x5000c5005cbf5e72 [SAT] SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 108 to 117
Sunday, August 03, 2014 8:09:27 AM Vault smartd[2162]: Device: /dev/disk/by-id/wwn-0x5000c5005cbff7de [SAT] Failed SMART usage Attribute: 184 End-to-End_Error.
Tuesday, August 05, 2014 7:39:26 PM Vault smartd[2162]: Device: /dev/disk/by-id/wwn-0x5000c5005cbff7de [SAT] SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 110 to 104
Wednesday, August 06, 2014 3:09:29 PM Vault smartd[2162]: Device: /dev/disk/by-id/wwn-0x5000c5005cbff7de [SAT] 40 Currently unreadable (pending) sectors
Wednesday, August 06, 2014 3:09:32 PM Vault smartd[2162]: Device: /dev/disk/by-id/wwn-0x5000c5005cbff7de [SAT] 40 Offline uncorrectable sectors
Wednesday, August 06, 2014 3:09:32 PM Vault smartd[2162]: Device: /dev/disk/by-id/wwn-0x5000c5005cbff7de [SAT] ATA error count increased from 91 to 97
Wednesday, August 06, 2014 3:09:32 PM Vault smartd[2162]: Device: /dev/disk/by-id/wwn-0x5000c5005cbff7de [SAT] SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 104 to 101
Wednesday, August 06, 2014 2:09:29 PM Vault smartd[2162]: Device: /dev/disk/by-id/wwn-0x5000c5005cbff7de [SAT] SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 118 to 107
Wednesday, August 06, 2014 3:09:32 PM Vault smartd[2162]: Device: /dev/disk/by-id/wwn-0x5000c5005cbff7de [SAT] SMART Usage Attribute: 187 Reported_Uncorrect changed from 9 to 3
Wednesday, August 06, 2014 3:09:32 PM Vault smartd[2162]: Device: /dev/disk/by-id/wwn-0x5000c5005cbff7de [SAT] SMART Usage Attribute: 188 Command_Timeout changed from 100 to 95
Sunday, August 03, 2014 8:09:29 AM Vault smartd[2162]: Device: /dev/disk/by-id/wwn-0x5000c5005daaadc0 [SAT] SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 111 to 117
Wednesday, August 06, 2014 4:09:27 AM Vault smartd[2162]: Device: /dev/disk/by-id/wwn-0x5000c5005daaadc0 [SAT] SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 120 to 104
Quote:
Originally Posted by smallpond
Please post the drive models and firmware versions from /proc/scsi/scsi
I hope this is the command you meant as I didn't get any output, sorry Im sure its just a typo on my end or a slight command modification needed to get the output, but I am not sure:
Assemble - read the man page on mdadm. I hate to give you a command that you will blindly run that could potentially lose your data. The assemble command has a force option to use a disk that is stale, so you should be able to recreate the raid on 3 drives and start it resyncing to the 4th. Ask questions if there's something you don't understand.
System log is named either /var/log/messages or /var/log/syslog depending on the whim of the distro creators. It may be very large, but you can look through it for 'sd' to find disk-related errors around the right times.
Your system was built without /proc/scsi support - that's ok. In that case:
Still going to read about that command and force, as well as ordered some replacment drives and downloaded a few gigabytes of syslogs to check, but for now here is this info:
Quote:
Originally Posted by smallpond
Your system was built without /proc/scsi support - that's ok. In that case:
I have the clone running 13hrs in it had about 70hrs left.
My Questions are:
Is their any way to speed it up while it is running?
Is this a recommended approach or should I cancel the clone, force the raid to assemble with the bad drive and just have it rebuild to the new drive?
I personally would have gone with bs=1G or some such "large" number which would reduce the overhead of tiny reads. Bar that, there is not much you can do if the copy is maxing out the ports/devices.
I would check dmesg while the dd is working, just to make sure no fixable errors are being issued... such as the device going down, then being brought back up (device/bus reset; a sure sign of bad cables amongst other things) but not so unfixable (by the os) that it competely gives up on the device and stops the dd.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.