RAID degraded, partition missing from md0
Hey guys,
We're having a very weird issue at work. Our Ubuntu server has 6 drives, set up with RAID1 as follows: /dev/md0, consisting of: /dev/sda1 /dev/sdb1 /dev/md1, consisting of: /dev/sda2 /dev/sdb2 /dev/md2, consisting of: /dev/sda3 /dev/sdb3 /dev/md3, consisting of: /dev/sdc1 /dev/sdd1 /dev/md4, consisting of: /dev/sde1 /dev/sdf1 As you can see, md0, md1 and md2 all use the same 2 drives (split into 3 partitions). I also have to note that this is done via ubuntu software raid, not hardware raid. Today, the /md0 RAID1 array shows as degraded - it is missing the /dev/sdb1 drive. But since /dev/sdb1 is only a partition (and /dev/sdb2 and /dev/sdb3 are working fine), it's obviously not the drive that's gone AWOL, it seems the partition itself is missing. How is that even possible? And what could we do to fix it? My output of cat /proc/mdstat: Code:
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] Any help would be greatly appreciated! |
Hi,
it's not so unusual to have problems with just one partition on a disk. You can try to rebuild with the existing sdb, or you can replace the sdb and then rebuild. See for example http://www.howtoforge.com/replacing_..._a_raid1_array for the latter option. However, before doing anything make sure you are familiar with: https://raid.wiki.kernel.org/index.php/Linux_Raid Evo2. |
Quote:
|
Hi,
Quote:
Code:
mdadm --assemble --scan Evo2. |
Quote:
|
Actually, let me clarify - if I do a:
Code:
mdadm --assemble --scan Code:
mdadm --assemble /dev/md0 /dev/sda1 /dev/sdb1 |
I think its better to stop md device. What is output of mdadm --detail /dev/md0
Thanks |
I can't stop the device :(
Also, the / root filesystem is mounted on md0. The output you requested is: Code:
/dev/md0: |
I think if its showing removed that following command should recover
mdadm /dev/md0 -a /dev/sdb1 Thanks |
Quote:
|
I am unable to see any error message above . Ideally for replacing device , I follow
mdadm /dev/md0 -f /dev/sdb1 mdadm /dev/md0 -r /dev/sdb1 mdadm /dev/md0 -a /dev/sdb1 Thanks |
Quote:
Code:
mdadm: add new device failed for /dev/sdb1 as 2: Invalid argument |
Got the following results:
Code:
root@lia:~# mdadm /dev/md0 -f /dev/sdb1 |
Hate to bump a thread, but I still need help with this. Any advice, anyone? :)
|
Hi,
mdadm doesn't seem to see /dev/sdb1 at all. I suggest you investigate its status with other tools. Eg fdisk Evo2. |
Quote:
|
Do below command showing any output?
Quote:
|
Quote:
Code:
brw-rw---- 1 root disk 8, 17 Nov 8 08:33 sdb1 |
Check the /dev directory and see if the /dev/sdb1 device actually exists. If it doesn't, you'll need to recreate it with fdisk, parted or whatever tool you prefer to use to manage partitions.
If the device is missing but the partition seems to be there, try running partprobe then check the /dev directory again. |
Quote:
|
The next step is to figure out why mdadm returns an error message when you try to reference /dev/sdb1. See what
Code:
mdadm --examine /dev/sdb1 According to /proc/mdstat (in your first post), /deb/md0 only has one member, /dev/sda1. As long as the /dev/sdb1 partition is valid and identical in size to /dev/sda1 (which fdisk -l /dev/sdb or parted /dev/sdb print should be able to confirm or deny), you should be able to re-add /dev/sdb1 with the following command: Code:
mdadm --manage /dev/md0 --add /dev/sdb1 Code:
smartctl -a /dev/sdb |
mdadm --examine /dev/sdb1 gives the following:
Code:
mdadm: No md superblock detected on /dev/sdb1. Code:
Model: ATA ST3000VX000-9YW1 (scsi) Code:
Model: ATA ST3000VX000-9YW1 (scsi) Code:
mdadm: add new device failed for /dev/sdb1 as 2: Invalid argument Code:
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-29-generic] (local build) Quote:
|
The /dev/sdb device has 15 "pending" sectors, meaning it's waiting for a write command to reallocate whose sectors. While 15 is not an alarmingly large number, the fact that they're all "pending" rather than "reallocated", suggests the defects may have appeared at approximately the same time, which could be an indication of drive failure. You should run badblocks -ns on /dev/sdb1 before proceeding, and check the S.M.A.R.T. status for /dev/sdb again when it's done.
The "invalid argument" error is usually caused by a non-removed device. The "--add" command is only valid if the array is online and can be expanded, or if a device has been removed. However, the output from mdadm --detail /dev/md0 in post #8 does indeed show the second device as "removed". Strange. Could you port the output from: Code:
ls /sys/block/md0/md/ |
I can't run the badblocks at the moment, as it uses all the server resources and totally kills the network users logged onto it :/
Which log file specifically do you want me to check when i try add the device back to md0? Output of ls /sys/block/md0/md/ is: Code:
array_size layout reshape_position sync_max Quote:
|
Do a tail -f /var/log/messages in one terminal window while you attempt to add /dev/sdb1 to md0 in another.
The files in /sys/block/md0/md confirms that there's no reference from md0 to anything other than /dev/sda1. It should be possible to add another device/partition. |
I don't have a /var/log/messages, but I did do a tail on the syslog, and it showed the following while trying to add the partition back to md0:
Code:
Nov 15 08:38:25 lia kernel: [674827.954967] ata1: EH complete |
It seems the md driver ran into one of the bad sectors on the drive. If you can't run badblocks, try using dd to overwrite the partition with zeros:
Code:
dd if=/dev/zero of=/dev/sdb1 bs=8192 oflag=direct The "oflag=direct" parameter bypasses the cache, and has the effect of slowing the process down significantly. With any luck, the other users won't notice anything. The real reason it's there, however, is to prevent cache management from doing read-ahead, as that would cause it to attempt to read the bad sectors, which in turn would cause dd to abort. |
Quote:
|
Quote:
Other than that, there's nothing in particular you need to consider before attempting to add the partition to the RAID array again. |
Quote:
I just hope the recovery process completes without any issues. I'll let you know! One thing that strikes me as a bit weird though: in all the arrays, the disks are ID's 0 and 1. But on md0, sda1 is id 0, and the re-added sdb1 is id 2, not id 1. Does that make a difference? Output of cat /proc/mdstat: Code:
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] |
Seems I spoke to soon. About 20% into the recovery process sdb1 failed again, and this time sdb2 in md1 also failed. Seems the whole sdb drive is busted.
Code:
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] Meh. |
Quote:
Assuming these are SATA drives, sdb is (most likely) the drive connected to the SATA port with the second lowest number that's in use. Since it's no longer part of the array, it will be the only inactive drive. If the drives have on-board activity LEDs (few do these days), you should be able to tell by just looking. You could try spinning the drive down with hdparm -Y. You should be able to hear it power down. |
Quote:
I'll try the hdparm on Monday. Is there a way to power it back up, as I might need to toggle it a few times to find the right one - there are 6 drives in that box :S Also, before I power down the drive and replace it, I'll need to remove sdb1, sdb2 and sdb3 from md0, md1 and md2. Do I just do that normally, as in: Code:
mdadm --manage /dev/md0 --fail /dev/sdb1 |
No, that's how you do it; first "--fail", then "--remove".
(And any kind of disk access should wake a sleeping drive, like running fdisk or parted, or dd'ing a few blocks to /dev/null.) |
Now something very concerning started happening.
I wanted to install a package using apt-get. I got the following error: Code:
root@lia:~# apt-get install gdisk Code:
root@lia:~# smartctl -a /dev/sdb Code:
root@lia:~# smartctl -a /dev/sda cat /proc/mdstat still shows: Code:
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] |
The faulty drive may be blocking the controller. An emergency reboot may be in order here.
You also need to check the S.M.A.R.T. status of all remaining drives asap. (For instance, are you sure the rebuild failure was caused by a write error on /dev/sdb, and not a read error on /dev/sda?) |
Quote:
The other drives: sdc has 0 pending sectors. sdd has 24 pending sectors, and shows "Error 244 occurred at disk power-on lifetime: 8689 hours (362 days + 1 hours)" sde has 0 pending sectors, but also shows "Error 51 occurred at disk power-on lifetime: 8009 hours (333 days + 17 hours)" sdf has 0 pending sectors. This spells crisis to me :/ Of the 6 drives, 3 seems to be busted, one on each array - and I have no idea what's going on with sda. |
The reboot command (or the 3-fingered salute) should be used, if possible. Only when that fails should one resort to alternate strategies involving the SysRq key or the power button.
Have you been checking these arrays regularly? I run Code:
echo check > /sys/devices/virtual/block/<md device>/md/sync_action |
Quote:
EDIT: Just lost remote connection. Server is still up as it's still routing traffic, but I can't access it via SSH anymore. |
It would have been really great if someone could unplug the drive causing these bus errors, but the problem is we don't know with 100 % certainty that /dev/sdb is the culprit (although it's more than likely). Also, the drives aren't labeled.
Does this server have built-in remote access functionality, or do you have to rely on the OS? Edit: I guess you need the OS, parts of which are probably spewing "oops" messages at the console right now. |
Quote:
|
Make sure to bring a live CD (like, say, System Rescue CD) in case the system fails to boot. You could even set up an emergency NAT router with a CD/DVD like that.
|
Quote:
|
Sounds like a plan.
|
Also, any idea why we're seeing errors on 3 drives instead of 1 (refer to post #37)? Normally I'd suspect a RAID controller, but this is software raid.
|
Must be the drives. There's no way other hardware or software can make a drive report "pending sectors" via S.M.A.R.T. Media error is the only possibility.
|
Ok, I'm on the premises. I turned off the server (it was hanging with alot of error messages, like you predicted). I removed sdb (I looked for the serial number on the drive casing, to match the serial number as reported by smartctl on sdb).
Booted up, and it's running now. But here's the really strange thing: Code:
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] Code:
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] EDIT: Here are the md device details: Code:
/dev/md0: Code:
/dev/md1: Code:
/dev/md2: Code:
/dev/md3: Code:
/dev/md4: |
...continued from previous post...
sda: Code:
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-29-generic] (local build) Code:
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-29-generic] (local build) Code:
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-29-generic] (local build) |
...continued from previous post...
sdd: Code:
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-29-generic] (local build) Code:
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-29-generic] (local build) |
...continued from previous post...
As you can see from all the stats in the above 3 posts, the sdb device doesn't have the original sdb serial number. Seems sdf renamed itself to sdb. Bizarre... |
All times are GMT -5. The time now is 05:42 PM. |