Partitions missing on startup after mdadm snafu
I have a new OpenSuSE 10.3 (18.104.22.168-0.1-default) server that I configured a RAID-1 root partition on.
The second hard drive of the pair turned out to be faulty (SMART errors), so I replaced it, and added the new drive to the array. Unfortunately, instead of telling mdadm to add sdb2, I accidentally gave it sdb.
So, I failed the drive, repartitioned it with fdisk, and added sdb2 into the array. The array rebuilt with no errors and I thought all was well... until the next reboot.
The system booted successfully, but my RAID volume started in degraded mode because the second partition (sdb2) did not exist in /dev. I am able to get the partitions into /dev using partprobe, which is confusing. Why would the partitions not already be there?
Another person posted the exact same problem in the newbie forum, but the response didn't make much sense to me, and the OP did not indicate if/how his problem was fixed. I would post the link to that thread, but the forum will not let me until I've posted at least once. I'll provide it in a reply.
Could you please help me? Lots of details are below. If you need any others, please ask.
The kernel is seeing the partitions. Excerpt from dmesg:
Link for the other post:
Okay, I'm in much the same boat as you.
The difference is I'm doing raid 6 on 8 drives.
I built the system with 6, everything seemed good, I added two more and I expanded on to those. All seemed good. I don't *think* I stuffed up the last two and used the disks rather than the partitions.
Ah, to recap. Failure of seeing two of the partitions in /dev
fdisking the drives shows that the partitions are there though, and looking all linux raidlike and just like the other visible six.
/proc/partitions does not list the two partitions.
Here's some interesting new information.
I was able to work around the failure by messing around with the mdadm.conf.
If I listed the devices as just being the partitions, the above failure was occurring - ie, two partitions go AWOL, mdadm can't start.
If I listed the devices as /dev/sda1,/dev/sdb1, etc, then mdadm would take an awfully long time to start, as would Ubuntu (7.10). When I'd log in, mdadm was not starting the array, and it was also consuming memory. Left unattended it would shortly bring the system to its knees by consuming everything. However, I did have all 8 partitions listed (not including the 9th drive with the system partition/swap/whatever). So if I killed the process, and I then changed the config back to use partitions for devices, then mdadm could happily start. Of course, if I rebooted, then I'm back to the starting scenario where it would fail to start because the two partitions were AWOL. So my workaround was to start in configuration 2, then kill mdadm, change config to configuration 1, restart mdadm, change config to configuration 2 for next reboot.
So, that was my workaround, but I wasn't overly happy about it.
So recently (about now), after several months of this, I decided to get more hardcore about finding out what was going on. Looking in the syslog, I had a lot of messages about array md1 already having disks. A hell of a lot. Eventually it would stop and it would start doing some sort of mdadm unbind.
I'm a little unsure as to whether these last two partitions are somehow tagged a little differently than the others, or whether mdadm was getting enthused about starting the array a little too early.
Currently I think I've done *something* (sigh) because it's not quite acting the same and I'm actually having a little trouble starting the array at all.
Just a moment ago I had it starting up degraded (minus the two disks - it is raid 6) but now I might have wandered a little, for it's not even doing that, but complaining about a missing md1 superblock. Oh, and all my devices seem to have shuffled along a bit (mdadm -E /dev/sda1 tells me that it's sdb1, with /dev/sdf1 being blank).
I never thought about partprobing. Thanks.
So by partprobing and getting the sdg1 and sdh1 back into /dev and partitions, I am able to start mdadm fine with 8 disks, no apparent issues...
But of course the issue is still not resolved, I've just got a workaround again for missing partitions on boot!
Long story short, with the sort of spontaneous 'inspiration' that I've had, combined with lack of knowledge, interesting snippets on random forums, and bits of "it's telling me I'm wrong but I'm pretty sure I'm right", I'm actually a little surprised I haven't lost the entire array yet!
It's probably time to stop tempting fate and get a little more methodical.
Incidentally, one time I logged in fast and dropped to a terminal and saw my two partitions there when I wasn't expecting them there ... and shortly after they were removed. Hmmmmm.
I've the same issue with disappearing /dev/sdb1,2 with combination of Raid1 md0. It started with accidentaly adding /dev/sdb to /dev/md0 instead of /dev/sdb1 (mdadm --manage /dev/md0 --add /dev/sdb instead of --add /dev/sdb1).
In dmesg (kern log)
md: md0 stopped.
md: could not open unknown-block(8,17).
md: md_import_device returned -6
md: kicking non-fresh sdb from array!
helped only erasing raid superblock on /dev/sdb (mdadm --misc --zero-superblock /dev/sdb) reboot and recreate partitions and assign /dev/sdb1 again to the raid.
I also solved mine the same way - I zeroed the superblock from the device (well, two devices, since I did it to two on a raid 6 array).
|All times are GMT -5. The time now is 04:29 PM.|