Software Raid Down / Inactive

weirdaljr · 08-08-2014, 09:17 AM

Hello All,

Last night I found my software raid 5 to be down. It is on a Server that is running Debian Squeeze 6.0.9 using 4x 3tb drives. I am new with these and could really use some help troubleshooting this to get it back up. I am a newer Linux user though I have used multiple distros over the years for basic servers I am still a amateur and this current server is a file store for my home windows network.

Below is the info I gathered so far.

Any help would be truly appreciated.

Code:

# lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 6.0.9 (squeeze)
Release:        6.0.9
Codename:       squeeze

Code:

# mdadm --query /dev/md0
/dev/md0: is an md device which is not active

Code:

# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid
id10]
md0 : inactive sdc[1](S) sde[3](S) sdd[2](S) sdb[0](S)
      11721062240 blocks super 1.2

unused devices: <none>

Code:

# mdadm --stop /dev/md0
mdadm: stopped /dev/md0

Code:

# mdadm --assemble /dev/md0
mdadm: /dev/md0 assembled from 2 drives - not enough to start the array.

Code:

# mdadm -D /dev/md0
mdadm: md device /dev/md0 does not appear to be active.

Code:

# mdadm -E /dev/sdb
/dev/sdb:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : d8cac9e1:9e2562d9:b4e7a873:eb4d9723
           Name : Vault:0  (local to host Vault)
  Creation Time : Fri May  2 10:09:10 2014
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 5860531120 (2794.52 GiB 3000.59 GB)
     Array Size : 17581590528 (8383.56 GiB 9001.77 GB
  Used Dev Size : 5860530176 (2794.52 GiB 3000.59 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : active
    Device UUID : a9d206fc:80dcbdee:328cfee3:9da0cf31

    Update Time : Wed Aug  6 14:47:48 2014
       Checksum : e1d64d08 - correct
         Events : 54338

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 0
   Array State : AA.A ('A' == active, '.' == missing)

Code:

# mdadm -E /dev/sdc
/dev/sdc:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : d8cac9e1:9e2562d9:b4e7a873:eb4d9723
           Name : Vault:0  (local to host Vault)
  Creation Time : Fri May  2 10:09:10 2014
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 5860531120 (2794.52 GiB 3000.59 GB)
     Array Size : 17581590528 (8383.56 GiB 9001.77 GB
  Used Dev Size : 5860530176 (2794.52 GiB 3000.59 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 690e5854:549b2e26:94ffb847:d8c66bdf

    Update Time : Thu Aug  7 22:09:50 2014
       Checksum : 82f1a505 - correct
         Events : 54694

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 1
   Array State : .A.A ('A' == active, '.' == missing)

Code:

# mdadm -E /dev/sdd
/dev/sdd:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : d8cac9e1:9e2562d9:b4e7a873:eb4d9723
           Name : Vault:0  (local to host Vault)
  Creation Time : Fri May  2 10:09:10 2014
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 5860531120 (2794.52 GiB 3000.59 GB)
     Array Size : 17581590528 (8383.56 GiB 9001.77 GB
  Used Dev Size : 5860530176 (2794.52 GiB 3000.59 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 28925ebc:0abd12f5:c6fe2dff:a18b69b0

    Update Time : Sun Aug  3 17:07:13 2014
       Checksum : 4246bc40 - correct
         Events : 6565

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 2
   Array State : AAAA ('A' == active, '.' == missing)

Code:

# mdadm -E /dev/sde
/dev/sde:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : d8cac9e1:9e2562d9:b4e7a873:eb4d9723
           Name : Vault:0  (local to host Vault)
  Creation Time : Fri May  2 10:09:10 2014
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 5860531120 (2794.52 GiB 3000.59 GB)
     Array Size : 17581590528 (8383.56 GiB 9001.77 GB
  Used Dev Size : 5860530176 (2794.52 GiB 3000.59 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : d88c823c:0f1990dd:6b0679a0:b3a43e5e

    Update Time : Thu Aug  7 22:09:50 2014
       Checksum : fa107c93 - correct
         Events : 54694

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 3
   Array State : .A.A ('A' == active, '.' == missing)

smallpond · 08-08-2014, 10:17 AM

Based on Update Time sdd dropped out Aug 3, and drive sdb dropped out Aug 6.
Once two drives fail the RAID fails. Your best bet is to force the assemble on sdb, sdc, sde, with sdd missing. Last add the stale sdd drive and resync to it. Hopefully you won't have lost anything too important.

Enterprise RAID drives have time-limited error recovery. Consumer drives assume they are non-RAID so will try for a long time before giving up. This results in timeouts on the system and failing the drive out of the RAID. Check the system log to see what caused the drive to fail. If it was a timeout, you can try increasing the default timeout from 60 seconds to about 2 minutes to give recovery a chance. For example:

echo 120 >/sys/block/sdd/device/timeout

That might help reduce the problem in the future.

Please post the drive models and firmware versions from /proc/scsi/scsi

weirdaljr · 08-08-2014, 11:40 AM

Quote:

Originally Posted by smallpond

Based on Update Time sdd dropped out Aug 3, and drive sdb dropped out Aug 6.
Once two drives fail the RAID fails. Your best bet is to force the assemble on sdb, sdc, sde, with sdd missing. Last add the stale sdd drive and resync to it. Hopefully you won't have lost anything too important.

Thanks so much for the reply! Sorry again I am a newb. How do you recommend I do that? What commands should I run?

Quote:

Originally Posted by smallpond

Enterprise RAID drives have time-limited error recovery. Consumer drives assume they are non-RAID so will try for a long time before giving up. This results in timeouts on the system and failing the drive out of the RAID. Check the system log to see what caused the drive to fail. If it was a timeout, you can try increasing the default timeout from 60 seconds to about 2 minutes to give recovery a chance. For example:

echo 120 >/sys/block/sdd/device/timeout

That might help reduce the problem in the future.

I found this in the SMART log. I tried to condense the errors to the most meaningful over the past few days:

Code:

Sunday, August 03, 2014	9:09:26 AM	Vault smartd[2162]: Device: /dev/disk/by-id/scsi-SATA_ST320410A_6FG07QVG [SAT]	 SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 57 to 58
Tuesday, August 05, 2014	12:39:26 AM	Vault smartd[2162]: Device: /dev/disk/by-id/scsi-SATA_ST320410A_6FG07QVG [SAT]	 SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 59 to 58
Sunday, August 03, 2014	8:09:31 AM	Vault smartd[2162]: Device: /dev/disk/by-id/wwn-0x5000c5005030fbb8 [SAT]	 SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 116 to 119
Tuesday, August 05, 2014	8:39:27 PM	Vault smartd[2162]: Device: /dev/disk/by-id/wwn-0x5000c5005030fbb8 [SAT]	 SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 120 to 102
Sunday, August 03, 2014	5:39:27 PM	Vault smartd[2162]: Device: /dev/disk/by-id/wwn-0x5000c5005cbf5e72 [SAT]	 failed to read SMART Attribute Data
Sunday, August 03, 2014	5:39:27 PM	Vault smartd[2162]: Device: /dev/disk/by-id/wwn-0x5000c5005cbf5e72 [SAT]	 not capable of SMART self-check
Sunday, August 03, 2014	5:39:28 PM	Vault smartd[2162]: Device: /dev/disk/by-id/wwn-0x5000c5005cbf5e72 [SAT]	 Read SMART Self Test Log Failed
Sunday, August 03, 2014	5:39:28 PM	Vault smartd[2162]: Device: /dev/disk/by-id/wwn-0x5000c5005cbf5e72 [SAT]	 Read Summary SMART Error Log failed
Sunday, August 03, 2014	8:09:30 AM	Vault smartd[2162]: Device: /dev/disk/by-id/wwn-0x5000c5005cbf5e72 [SAT]	 SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 108 to 117
Sunday, August 03, 2014	8:09:27 AM	Vault smartd[2162]: Device: /dev/disk/by-id/wwn-0x5000c5005cbff7de [SAT]	 Failed SMART usage Attribute: 184 End-to-End_Error.
Tuesday, August 05, 2014	7:39:26 PM	Vault smartd[2162]: Device: /dev/disk/by-id/wwn-0x5000c5005cbff7de [SAT]	 SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 110 to 104
Wednesday, August 06, 2014	3:09:29 PM	Vault smartd[2162]: Device: /dev/disk/by-id/wwn-0x5000c5005cbff7de [SAT]	 40 Currently unreadable (pending) sectors
Wednesday, August 06, 2014	3:09:32 PM	Vault smartd[2162]: Device: /dev/disk/by-id/wwn-0x5000c5005cbff7de [SAT]	 40 Offline uncorrectable sectors
Wednesday, August 06, 2014	3:09:32 PM	Vault smartd[2162]: Device: /dev/disk/by-id/wwn-0x5000c5005cbff7de [SAT]	 ATA error count increased from 91 to 97
Wednesday, August 06, 2014	3:09:32 PM	Vault smartd[2162]: Device: /dev/disk/by-id/wwn-0x5000c5005cbff7de [SAT]	 SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 104 to 101
Wednesday, August 06, 2014	2:09:29 PM	Vault smartd[2162]: Device: /dev/disk/by-id/wwn-0x5000c5005cbff7de [SAT]	 SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 118 to 107
Wednesday, August 06, 2014	3:09:32 PM	Vault smartd[2162]: Device: /dev/disk/by-id/wwn-0x5000c5005cbff7de [SAT]	 SMART Usage Attribute: 187 Reported_Uncorrect changed from 9 to 3
Wednesday, August 06, 2014	3:09:32 PM	Vault smartd[2162]: Device: /dev/disk/by-id/wwn-0x5000c5005cbff7de [SAT]	 SMART Usage Attribute: 188 Command_Timeout changed from 100 to 95
Sunday, August 03, 2014	8:09:29 AM	Vault smartd[2162]: Device: /dev/disk/by-id/wwn-0x5000c5005daaadc0 [SAT]	 SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 111 to 117
Wednesday, August 06, 2014	4:09:27 AM	Vault smartd[2162]: Device: /dev/disk/by-id/wwn-0x5000c5005daaadc0 [SAT]	 SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 120 to 104

Quote:

Originally Posted by smallpond

Please post the drive models and firmware versions from /proc/scsi/scsi

I hope this is the command you meant as I didn't get any output, sorry Im sure its just a typo on my end or a slight command modification needed to get the output, but I am not sure:

Code:

root@Vault:/var/log# cat /proc/scsi/scsi
cat: /proc/scsi/scsi: No such file or directory
root@Vault:/var/log# ls /proc/scsi/scsi
ls: cannot access /proc/scsi/scsi: No such file or directory
root@Vault:/var/log# ls /proc/scsi
ls: cannot access /proc/scsi: No such file or directory
root@Vault:/var/log# ls /proc
1      146    18     2307   279        cmdline      keys          stat
10     14762  1801   232    3          consoles     key-users     swaps
1035   14900  1825   2360   325        cpuinfo      kmsg          sys
11     14903  1847   2373   5          crypto       kpagecount    sysrq-trigger
1145   14904  1861   2374   536        devices      kpageflags    sysvipc
1158   15709  19     2375   580        device-tree  loadavg       timer_list
12     16     1919   2376   5972       diskstats    locks         timer_stats
12517  160    1924   2377   5998       dma          mdstat        tty
13     162    1926   2378   6          dri          meminfo       uptime
1313   164    19372  26023  6001       driver       misc          version
1318   16473  19378  26036  610        execdomains  modules       vmallocinfo
13265  16474  19379  26037  6170       fb           mounts        vmstat
1347   1650   19380  26038  7          filesystems  mtrr          zoneinfo
1349   166    19381  26039  8          fs           net
13557  1674   19382  26176  9          interrupts   pagetypeinfo
1364   168    1989   26180  acpi       iomem        partitions
1380   1690   1995   26182  asound     ioports      sched_debug
1381   17     2      26184  buddyinfo  irq          self
14     171    20     277    bus        kallsyms     slabinfo
14132  1763   2288   278    cgroups    kcore        softirqs

Thanks again, your help is truly appreciated. Any suggestions on how to proceed from here?

smallpond · 08-08-2014, 01:38 PM

Assemble - read the man page on mdadm. I hate to give you a command that you will blindly run that could potentially lose your data. The assemble command has a force option to use a disk that is stale, so you should be able to recreate the raid on 3 drives and start it resyncing to the 4th. Ask questions if there's something you don't understand.

System log is named either /var/log/messages or /var/log/syslog depending on the whim of the distro creators. It may be very large, but you can look through it for 'sd' to find disk-related errors around the right times.

Your system was built without /proc/scsi support - that's ok. In that case:

Code:

cat /sys/block/sdb/device/model
cat /sys/block/sdb/device/vendor
cat /sys/block/sdb/device/rev

etc.

weirdaljr · 08-09-2014, 11:13 AM

Still going to read about that command and force, as well as ordered some replacment drives and downloaded a few gigabytes of syslogs to check, but for now here is this info:

Quote:

Originally Posted by smallpond

Your system was built without /proc/scsi support - that's ok. In that case:

Code:

cat /sys/block/sdb/device/model
cat /sys/block/sdb/device/vendor
cat /sys/block/sdb/device/rev

etc.

Code:

root@Vault:/var/log# cat /sys/block/sdb/device/model
ST3000DM001-1CH1
root@Vault:/var/log# cat /sys/block/sdb/device/vendor
ATA
root@Vault:/var/log# cat /sys/block/sdb/device/rev
CC44

root@Vault:/var/log# cat /sys/block/sdc/device/model
ST3000DM001-1CH1
root@Vault:/var/log# cat /sys/block/sdc/device/vendor
ATA
root@Vault:/var/log# cat /sys/block/sdc/device/rev
CC24

root@Vault:/var/log# cat /sys/block/sdd/device/model
ST3000DM001-1CH1
root@Vault:/var/log# cat /sys/block/sdd/device/vendor
ATA
root@Vault:/var/log# cat /sys/block/sdd/device/rev
CC44

root@Vault:/var/log# cat /sys/block/sde/device/model
ST3000DM001-1CH1
root@Vault:/var/log# cat /sys/block/sde/device/vendor
ATA
root@Vault:/var/log# cat /sys/block/sde/device/rev
CC26

weirdaljr · 08-12-2014, 08:16 AM

I got a new drive in and I thought it be best to clone one of the bad drives first before trying to bring the raid back online. Using:

root@Vault:~# dd if=/dev/sdd of=/dev/sda bs=4096 conv=notrunc,noerror,sync

I have the clone running 13hrs in it had about 70hrs left.

My Questions are:

Is their any way to speed it up while it is running?

Is this a recommended approach or should I cancel the clone, force the raid to assemble with the bad drive and just have it rebuild to the new drive?

JonathanWilson · 08-12-2014, 10:30 AM

Quote:

Originally Posted by weirdaljr

I got a new drive in and I thought it be best to clone one of the bad drives first before trying to bring the raid back online. Using:

root@Vault:~# dd if=/dev/sdd of=/dev/sda bs=4096 conv=notrunc,noerror,sync

I have the clone running 13hrs in it had about 70hrs left.

My Questions are:

Is their any way to speed it up while it is running?

Is this a recommended approach or should I cancel the clone, force the raid to assemble with the bad drive and just have it rebuild to the new drive?

I personally would have gone with bs=1G or some such "large" number which would reduce the overhead of tiny reads. Bar that, there is not much you can do if the copy is maxing out the ports/devices.

I would check dmesg while the dd is working, just to make sure no fixable errors are being issued... such as the device going down, then being brought back up (device/bus reset; a sure sign of bad cables amongst other things) but not so unfixable (by the os) that it competely gives up on the device and stops the dd.