mdadm error replacing a failed disk
Hi Helpful Ubuntu Folks,
I have a four disk mdadm RAID5 on which one of the disk failed. The disk just completely died and was not visible by the OS. I powered down, took the disk out, restarted and left the array unstarted. I then bought a new disk and I'm now trying to add to the array and get the array back on its feet. Here is what the "detail" command tells me: excession# mdadm -D /dev/md127 /dev/md127: Version : 1.2 Creation Time : Sun May 29 18:32:22 2011 Raid Level : raid5 Used Dev Size : 1465137152 (1397.26 GiB 1500.30 GB) Raid Devices : 4 Total Devices : 3 Persistence : Superblock is persistent Update Time : Fri Oct 21 22:26:06 2011 State : active, degraded, Not Started Active Devices : 3 Working Devices : 3 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 512K Name : excession:0 (local to host excession) UUID : 557a3ef8:14251ba9:d4aa50ac:bd9b7d5c Events : 136609 Number Major Minor RaidDevice State 0 8 0 0 active sync /dev/sda 1 8 16 1 active sync /dev/sdb 2 0 0 2 removed 4 8 48 3 active sync /dev/sdd My new disk has been detected by the OS and assigned to /dev/sdc, which was the same designation as the old failed disk. When I try to add the new disk I get a really unhelpful error message: excession# mdadm --manage /dev/md127 --add /dev/sdc mdadm: add new device failed for /dev/sdc as 5: Invalid argument and when I try to run the arary, I get an equally unhelpful message: excession# mdadm --run /dev/md127 mdadm: failed to run array /dev/md127: Input/output error Can someone please help me understand what mdadm is doing and how to get my array back? I'm really reluctant to issue commands against the thing (such a attempting a new create) without understanding why it is not working and what the two errors indicate. Thanks, Saline |
Quote:
Quote:
Quote:
Quote:
Code:
ls /sys/block/md127/md/ |
Thanks smallpond, that explains how it got into this mess. I'm not sure that I could have removed the drive before powering down, though. When the drive failed, it totally failed and the OS (or, I guess, udev) was not even creating an entry under /dev for it. Thinking it back, it might have been that the drive actually failed during a reboot.
I have made some progress, I think. I have managed to get device #2 marked as failed and then removed and my new drive added as a spare: excession# mdadm -D /dev/md127 /dev/md127: Version : 1.2 Creation Time : Sun May 29 18:32:22 2011 Raid Level : raid5 Used Dev Size : 1465137152 (1397.26 GiB 1500.30 GB) Raid Devices : 4 Total Devices : 4 Persistence : Superblock is persistent Update Time : Fri Oct 21 22:26:06 2011 State : active, degraded, Not Started Active Devices : 3 Working Devices : 4 Failed Devices : 0 Spare Devices : 1 Layout : left-symmetric Chunk Size : 512K Name : excession:0 (local to host excession) UUID : 557a3ef8:14251ba9:d4aa50ac:bd9b7d5c Events : 136609 Number Major Minor RaidDevice State 0 8 0 0 active sync /dev/sda 1 8 16 1 active sync /dev/sdb 2 0 0 2 removed 4 8 48 3 active sync /dev/sdd 5 8 32 - spare /dev/sdc However, when I try to run the array I get: excession# mdadm --run /dev/md127 mdadm: failed to run array /dev/md127: Input/output error and in /var/syslog I see: Oct 24 21:53:55 excession kernel: [177044.777539] md/raid:md127: not clean -- starting background reconstruction Oct 24 21:53:55 excession kernel: [177044.777561] md/raid:md127: device sdd operational as raid disk 3 Oct 24 21:53:55 excession kernel: [177044.777566] md/raid:md127: device sdb operational as raid disk 1 Oct 24 21:53:55 excession kernel: [177044.777570] md/raid:md127: device sda operational as raid disk 0 Oct 24 21:53:55 excession kernel: [177044.778758] md/raid:md127: allocated 4282kB Oct 24 21:53:55 excession kernel: [177044.778827] md/raid:md127: cannot start dirty degraded array. Oct 24 21:53:55 excession kernel: [177044.778843] RAID conf printout: Oct 24 21:53:55 excession kernel: [177044.778847] --- level:5 rd:4 wd:3 Oct 24 21:53:55 excession kernel: [177044.778851] disk 0, o:1, dev:sda Oct 24 21:53:55 excession kernel: [177044.778853] disk 1, o:1, dev:sdb Oct 24 21:53:55 excession kernel: [177044.778856] disk 3, o:1, dev:sdd Oct 24 21:53:55 excession kernel: [177044.779322] md/raid:md127: failed to run raid set. Oct 24 21:53:55 excession kernel: [177044.779328] md: pers->run() failed ... I don't understand why run is not working. As I understand it, mdadm --run should start the array even if there is one disk missing; there are still 3 of the original 4 disks present so it should be able to run the array in a degraded state. I would expect it to start up and then start rebuilding the data on the spare. The log message implies that it is "starting background reconstruction" but it is not clear to me if this is actually happening or not. /proc/mdstat doesn't show anything in progress: excession# cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md127 : inactive sdd[4] sdb[1] sda[0] sdc[5](S) 6837302240 blocks super 1.2 unused devices: <none> Any more ideas on how I get this thing started? Thanks, Saline |
I've always built mds out of partitions rather than whole disks, so I went looking for pros and cons and found this:
http://www.nber.org/sys-admin/linux-nas-raid.html See the section: Can you use whole drive partitions? So its better to use sda1, sdb1, sdc1 etc. You can check the number of blocks with: cat /proc/partitions |
That's not the problem. The old disk was 1.5TB, the new one is 2TB.
Any more tips on how to interpret the unhelpful log messages? :-/ --- Saline |
Quote:
looks like the issue is that "cannot start dirty degraded array" Looks more like a moral problem, doesn't it? :) Anyway, I think the array is not in a consistent state. A reboot without a proper shutdown can cause this. You can try: Code:
cat /sys/block/md127/md/array_state |
My morals are fine. My morale on the other hand is definitely degraded. :-/
This is the second time I have had disk fail while using mdadm and the second time the system has completely let me down. What's the point of running a raid array if every time a single disk fails it kills the whole array? I really cannot see what I did wrong, or what I could do different if the situation were to happen again. The data on the disks is not irreplaceable data -- everything that was on there is backed up in some form or another, or is stuff which I'm not going to be too heartbroken about losing. It is, though, a massive pain in the arse to scrap the whole array and start again. Especially given that my confidence in Linux software raid is destroyed and I would need to figure out another way of using the disks -- LVM, perhaps. At least with LVM, a single disk failure only loses the data on that disk, not the whole array. Is mdadm really that unreliable? The link you posted before implies that when a disk fails if there is a single bad sector on _any_ of the remaining disks then the array is toast. Really? Is that what has bitten me -- twice? I would really love to be able to recover some data from this thing. Even if some of it is trashed. I tried what you suggested: excession# cat /sys/block/md127/md/array_state inactive excession# echo clean > /sys/block/md127/md/array_state echo: write error: invalid argument I'm getting _really_ tired of that "invalid argument" message. Vim's error message when trying to write the file is: "/sys/devices/virtual/block/md127/md/array_state" E667: Fsync failed Any more ideas? |
Success!
Another bout of random stabbing at the keyboard finally led to an unexpected success! I just gave up on trying to get /dev/md127 into a working state and tried creating a new array from the same disks, using the --assume-clean flag to not kill the existing data. This is the flag about which man mdadm says, "Use this only if you really know what you are doing." Well, I really didn't know what I was doing, but I was out of options. First I stopped /dev/md127 and released the component disks: Code:
excession# mdadm -S /dev/md127 Code:
excession# mdadm --create /dev/md0 --assume-clean --level=5 --verbose --raid-devices=4 /dev/sda /dev/sdb missing /dev/sdd Code:
excession# mdadm --add /dev/md0 /dev/sdc Code:
excession# cat /proc/mdstat Still, I am happy. The array is finally alive again, my data is recovered and, best of all, I don't have to argue with any more invalids! Thanks for all your help, smallpond. --- Saline |
alternative
Had a similar situation:
Code:
# mdadm -D /dev/md3 Code:
mdadm -S /dev/md3 |
All times are GMT -5. The time now is 05:01 PM. |