RAID degraded, partition missing from md0

reano · 11-15-2013, 11:04 AM

Seems I spoke to soon. About 20% into the recovery process sdb1 failed again, and this time sdb2 in md1 also failed. Seems the whole sdb drive is busted.

Code:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdb1[2](F) sda1[0]
      1464710976 blocks super 1.2 [2/1] [U_]

md1 : active raid1 sda2[0] sdb2[1](F)
      24006528 blocks super 1.2 [2/1] [U_]

md2 : active raid1 sdb3[1] sda3[0]
      1441268544 blocks super 1.2 [2/2] [UU]

md3 : active raid1 sdc1[0] sdd1[1]
      2930133824 blocks super 1.2 [2/2] [UU]

md4 : active raid1 sdf2[1] sde2[0]
      2929939264 blocks super 1.2 [2/2] [UU]

unused devices: <none>

I'll have to replace the drive. Now the tricky part is, how do I know which physical hard drive is sdb? Is there a way to tell?

Meh.

Ser Olmy · 11-15-2013, 11:17 AM

Quote:

Originally Posted by reano

I'll have to replace the drive. Now the tricky part is, how do I know which physical hard drive is sdb? Is there a way to tell?

Now you know why RAID array drives should be clearly labeled...

Assuming these are SATA drives, sdb is (most likely) the drive connected to the SATA port with the second lowest number that's in use.

Since it's no longer part of the array, it will be the only inactive drive. If the drives have on-board activity LEDs (few do these days), you should be able to tell by just looking.

You could try spinning the drive down with hdparm -Y. You should be able to hear it power down.

reano · 11-15-2013, 11:27 AM

Quote:

Originally Posted by Ser Olmy

Now you know why RAID array drives should be clearly labeled...

Yup, lesson learned indeed.

I'll try the hdparm on Monday. Is there a way to power it back up, as I might need to toggle it a few times to find the right one - there are 6 drives in that box :S

Also, before I power down the drive and replace it, I'll need to remove sdb1, sdb2 and sdb3 from md0, md1 and md2. Do I just do that normally, as in:

Code:

mdadm --manage /dev/md0 --fail /dev/sdb1
mdadm --manage /dev/md0 --remove /dev/sdb1

mdadm --manage /dev/md1 --fail /dev/sdb2
mdadm --manage /dev/md1 --remove /dev/sdb2


mdadm --manage /dev/md2 --fail /dev/sdb3
mdadm --manage /dev/md2 --remove /dev/sdb3

Or is there another way to go about it?

Ser Olmy · 11-15-2013, 11:33 AM

No, that's how you do it; first "--fail", then "--remove".

(And any kind of disk access should wake a sleeping drive, like running fdisk or parted, or dd'ing a few blocks to /dev/null.)

reano · 11-15-2013, 11:38 AM

Now something very concerning started happening.
I wanted to install a package using apt-get. I got the following error:

Code:

root@lia:~# apt-get install gdisk
-bash: /usr/bin/apt-get: Input/output error

So then I did:

Code:

root@lia:~# smartctl -a /dev/sdb

smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-29-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

Vendor:               /0:0:1:0
Product:
User Capacity:        600Â*332Â*565Â*813Â*390Â*450 bytes [600 PB]
Logical block size:   774843950 bytes
scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46
>> Terminate command early due to bad response to IEC mode page
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.
Bus error

But, I also get the following on sda:

Code:

root@lia:~# smartctl -a /dev/sda

smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-29-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

Vendor:               /0:0:0:0
Product:
User Capacity:        600Â*332Â*565Â*813Â*390Â*450 bytes [600 PB]
Logical block size:   774843950 bytes
scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46
>> Terminate command early due to bad response to IEC mode page
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.
Bus error

What the heck....? Is sda failing now as well?

cat /proc/mdstat still shows:

Code:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdb1[2](F) sda1[0]
      1464710976 blocks super 1.2 [2/1] [U_]

md1 : active raid1 sda2[0] sdb2[1](F)
      24006528 blocks super 1.2 [2/1] [U_]

md2 : active raid1 sdb3[1] sda3[0]
      1441268544 blocks super 1.2 [2/2] [UU]

md3 : active raid1 sdc1[0] sdd1[1]
      2930133824 blocks super 1.2 [2/2] [UU]

md4 : active raid1 sdf2[1] sde2[0]
      2929939264 blocks super 1.2 [2/2] [UU]

unused devices: <none>

Indicating that only sdb failed, with 2 out of the 3 partitions down so far.

Ser Olmy · 11-15-2013, 11:41 AM

The faulty drive may be blocking the controller. An emergency reboot may be in order here.

You also need to check the S.M.A.R.T. status of all remaining drives asap.

(For instance, are you sure the rebuild failure was caused by a write error on /dev/sdb, and not a read error on /dev/sda?)

reano · 11-15-2013, 11:48 AM

Quote:

Originally Posted by Ser Olmy

The faulty drive may be blocking the controller. An emergency reboot may be in order here.

You also need to check the S.M.A.R.T. status of all remaining drives asap.

(For instance, are you sure the rebuild failure was caused by a write error on /dev/sdb, and not a read error on /dev/sda?)

Normal reboot console command? Or is there another way to do an emergency reboot?

The other drives:

sdc has 0 pending sectors.
sdd has 24 pending sectors, and shows "Error 244 occurred at disk power-on lifetime: 8689 hours (362 days + 1 hours)"
sde has 0 pending sectors, but also shows "Error 51 occurred at disk power-on lifetime: 8009 hours (333 days + 17 hours)"
sdf has 0 pending sectors.

This spells crisis to me :/ Of the 6 drives, 3 seems to be busted, one on each array - and I have no idea what's going on with sda.

Ser Olmy · 11-15-2013, 11:55 AM

The reboot command (or the 3-fingered salute) should be used, if possible. Only when that fails should one resort to alternate strategies involving the SysRq key or the power button.

Have you been checking these arrays regularly? I run

Code:

echo check > /sys/devices/virtual/block/<md device>/md/sync_action

at least once a week. Also, one should always monitor the S.M.A.R.T. status of all drives with smartd.

reano · 11-15-2013, 11:56 AM

Quote:

Originally Posted by Ser Olmy

The reboot command (or the 3-fingered salute) should be used, if possible. Only when that fails should one resort to alternate strategies involving the SysRq key or the power button.

Have you been checking these arrays regularly? I run

Code:

echo check > /sys/devices/virtual/block/<md device>/md/sync_action

at least weekly. Also, one should always monitor the S.M.A.R.T. status of all drives with smartd.

Do I need to remove any drives before rebooting? The server is offsite, and I'm accessing it remotely at the moment.
EDIT: Just lost remote connection. Server is still up as it's still routing traffic, but I can't access it via SSH anymore.

Ser Olmy · 11-15-2013, 12:02 PM

It would have been really great if someone could unplug the drive causing these bus errors, but the problem is we don't know with 100 % certainty that /dev/sdb is the culprit (although it's more than likely). Also, the drives aren't labeled.

Does this server have built-in remote access functionality, or do you have to rely on the OS?

Edit: I guess you need the OS, parts of which are probably spewing "oops" messages at the console right now.

reano · 11-15-2013, 12:04 PM

Quote:

Originally Posted by Ser Olmy

It would have been really great if someone could unplug the drive causing these bus errors, but the problem is we don't know with 100 % certainty that /dev/sdb is the culprit (although it's more than likely). Also, the drives aren't labeled.

Does this server have built-in remote access functionality, or do you have to rely on the OS?

Edit: I guess you need the OS, parts of which are probably spewing "oops" messages at the console right now.

See my edit. I'll have to drive in and shutdown -h, then locate sdb, disconnect it, and start her back up. Anything else I need to know before going in? (if the server doesn't come back up I won't have internet access from the premises... talk about a double-crisis)

Ser Olmy · 11-15-2013, 12:07 PM

Make sure to bring a live CD (like, say, System Rescue CD) in case the system fails to boot. You could even set up an emergency NAT router with a CD/DVD like that.

reano · 11-15-2013, 12:09 PM

Quote:

Originally Posted by Ser Olmy

Make sure to bring a live CD (like, say, System Rescue CD) in case the system fails to boot. You could even set up an emergency NAT router with a CD/DVD like that.

Will do. If possible, I'll still try to remove sdb1,2,3 from md0,1,2 before shutting down and removing the drive. Right?

Ser Olmy · 11-15-2013, 12:10 PM

Sounds like a plan.

reano · 11-15-2013, 12:14 PM

Also, any idea why we're seeing errors on 3 drives instead of 1 (refer to post #37)? Normally I'd suspect a RAID controller, but this is software raid.