LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Hardware (http://www.linuxquestions.org/questions/linux-hardware-18/)
-   -   Partitions missing on startup after mdadm snafu (http://www.linuxquestions.org/questions/linux-hardware-18/partitions-missing-on-startup-after-mdadm-snafu-646547/)

KingPong 06-02-2008 09:08 PM

Partitions missing on startup after mdadm snafu
 
Hello,

I have a new OpenSuSE 10.3 (2.6.22.17-0.1-default) server that I configured a RAID-1 root partition on.

The second hard drive of the pair turned out to be faulty (SMART errors), so I replaced it, and added the new drive to the array. Unfortunately, instead of telling mdadm to add sdb2, I accidentally gave it sdb.

So, I failed the drive, repartitioned it with fdisk, and added sdb2 into the array. The array rebuilt with no errors and I thought all was well... until the next reboot.

The system booted successfully, but my RAID volume started in degraded mode because the second partition (sdb2) did not exist in /dev. I am able to get the partitions into /dev using partprobe, which is confusing. Why would the partitions not already be there?

Another person posted the exact same problem in the newbie forum, but the response didn't make much sense to me, and the OP did not indicate if/how his problem was fixed. I would post the link to that thread, but the forum will not let me until I've posted at least once. I'll provide it in a reply.

Could you please help me? Lots of details are below. If you need any others, please ask.

The kernel is seeing the partitions. Excerpt from dmesg:
Code:

sd 1:0:0:0: [sdb] 976773168 512-byte hardware sectors (500108 MB)
sd 1:0:0:0: [sdb] Write Protect is off
sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
 sdb: sdb1 sdb2
sd 1:0:0:0: [sdb] Attached SCSI disk

fdisk also sees the partitions:
Code:

# fdisk -l /dev/sdb

Disk /dev/sdb: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x0005e7eb

  Device Boot      Start        End      Blocks  Id  System
/dev/sdb1              1          26      208813+  83  Linux
/dev/sdb2              27      60801  488175187+  fd  Linux raid autodetect

But for some reason, the device nodes are not being created:
Code:

# ls -l /dev/sdb*
brw-r----- 1 root disk 8, 16 May 31 16:28 /dev/sdb

# grep sdb /proc/partitions
  8    16  488386584 sdb

And mdadm can't start the array properly: (from boot.msg)
Code:

Creating device nodes with udev
mdadm: failed to add /dev/sdb2 to /dev/md0: No such device or address
mdadm: /dev/md0 has been started with 1 drive (out of 2).

If I use partprobe to refresh the partition table on the live system, I can then add sdb2 to my RAID array and it will resync. This is good until the next reboot, at which point the partition is missing again, and I have to resync all over again.

Code:

# partprobe
# mdadm /dev/md0 -a /dev/sdb2
mdadm: re-added /dev/sdb2
# cat /proc/mdstat
Personalities : [raid1] [raid0] [raid6] [raid5] [raid4] [linear]
md0 : active raid1 sdb2[2] sda2[0]
      488175048 blocks super 1.0 [2/1] [U_]
      [>....................]  recovery =  0.1% (621824/488175048) finish=156.7min speed=51818K/sec
      bitmap: 19/466 pages [76KB], 512KB chunk

unused devices: <none>

lspci:
Code:

# lspci
00:00.0 Host bridge: Intel Corporation 82Q35 Express DRAM Controller (rev 02)
00:02.0 VGA compatible controller: Intel Corporation 82Q35 Express Integrated Graphics Controller (rev 02)
00:1a.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #4 (rev 02)
00:1a.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #5 (rev 02)
00:1a.2 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #6 (rev 02)
00:1a.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #2 (rev 02)
00:1b.0 Audio device: Intel Corporation 82801I (ICH9 Family) HD Audio Controller (rev 02)
00:1c.0 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 1 (rev 02)
00:1c.4 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 5 (rev 02)
00:1c.5 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 6 (rev 02)
00:1d.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1 (rev 02)
00:1d.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2 (rev 02)
00:1d.2 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3 (rev 02)
00:1d.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1 (rev 02)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 92)
00:1f.0 ISA bridge: Intel Corporation 82801IR (ICH9R) LPC Interface Controller (rev 02)
00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA AHCI Controller (rev 02)
00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02)
00:1f.6 Signal processing controller: Intel Corporation 82801I (ICH9 Family) Thermal Subsystem (rev 02)
0d:00.0 Ethernet controller: Intel Corporation 82573V Gigabit Ethernet Controller (Copper) (rev 03)
0f:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet Controller
11:08.0 IDE interface: Integrated Technology Express, Inc. Unknown device 8213
11:0b.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8139/8139C/8139C+ (rev 10)
11:0c.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8139/8139C/8139C+ (rev 10)


KingPong 06-02-2008 09:09 PM

Link for the other post:
http://www.linuxquestions.org/questi...nd-dev-622612/

willow_of_oz 06-15-2008 09:57 AM

Okay, I'm in much the same boat as you.
The difference is I'm doing raid 6 on 8 drives.
I built the system with 6, everything seemed good, I added two more and I expanded on to those. All seemed good. I don't *think* I stuffed up the last two and used the disks rather than the partitions.

Ah, to recap. Failure of seeing two of the partitions in /dev
fdisking the drives shows that the partitions are there though, and looking all linux raidlike and just like the other visible six.
/proc/partitions does not list the two partitions.

Here's some interesting new information.
I was able to work around the failure by messing around with the mdadm.conf.
If I listed the devices as just being the partitions, the above failure was occurring - ie, two partitions go AWOL, mdadm can't start.
If I listed the devices as /dev/sda1,/dev/sdb1, etc, then mdadm would take an awfully long time to start, as would Ubuntu (7.10). When I'd log in, mdadm was not starting the array, and it was also consuming memory. Left unattended it would shortly bring the system to its knees by consuming everything. However, I did have all 8 partitions listed (not including the 9th drive with the system partition/swap/whatever). So if I killed the process, and I then changed the config back to use partitions for devices, then mdadm could happily start. Of course, if I rebooted, then I'm back to the starting scenario where it would fail to start because the two partitions were AWOL. So my workaround was to start in configuration 2, then kill mdadm, change config to configuration 1, restart mdadm, change config to configuration 2 for next reboot.

So, that was my workaround, but I wasn't overly happy about it.
So recently (about now), after several months of this, I decided to get more hardcore about finding out what was going on. Looking in the syslog, I had a lot of messages about array md1 already having disks. A hell of a lot. Eventually it would stop and it would start doing some sort of mdadm unbind.
I'm a little unsure as to whether these last two partitions are somehow tagged a little differently than the others, or whether mdadm was getting enthused about starting the array a little too early.
Currently I think I've done *something* (sigh) because it's not quite acting the same and I'm actually having a little trouble starting the array at all.
Just a moment ago I had it starting up degraded (minus the two disks - it is raid 6) but now I might have wandered a little, for it's not even doing that, but complaining about a missing md1 superblock. Oh, and all my devices seem to have shuffled along a bit (mdadm -E /dev/sda1 tells me that it's sdb1, with /dev/sdf1 being blank).

I never thought about partprobing. Thanks.
So by partprobing and getting the sdg1 and sdh1 back into /dev and partitions, I am able to start mdadm fine with 8 disks, no apparent issues...
But of course the issue is still not resolved, I've just got a workaround again for missing partitions on boot!

willow_of_oz 06-15-2008 10:15 AM

Long story short, with the sort of spontaneous 'inspiration' that I've had, combined with lack of knowledge, interesting snippets on random forums, and bits of "it's telling me I'm wrong but I'm pretty sure I'm right", I'm actually a little surprised I haven't lost the entire array yet!
It's probably time to stop tempting fate and get a little more methodical.

Incidentally, one time I logged in fast and dropped to a terminal and saw my two partitions there when I wasn't expecting them there ... and shortly after they were removed. Hmmmmm.

festr 01-30-2009 03:03 PM

I've the same issue with disappearing /dev/sdb1,2 with combination of Raid1 md0. It started with accidentaly adding /dev/sdb to /dev/md0 instead of /dev/sdb1 (mdadm --manage /dev/md0 --add /dev/sdb instead of --add /dev/sdb1).

In dmesg (kern log)
md: md0 stopped.
md: bind<sdb>
md: could not open unknown-block(8,17).
md: md_import_device returned -6
md: bind<sda1>
md: kicking non-fresh sdb from array!
md: unbind<sdb>
md: export_rdev(sdb)

helped only erasing raid superblock on /dev/sdb (mdadm --misc --zero-superblock /dev/sdb) reboot and recreate partitions and assign /dev/sdb1 again to the raid.

KingPong 01-30-2009 09:54 PM

Quote:

Originally Posted by festr (Post 3426651)
helped only erasing raid superblock on /dev/sdb (mdadm --misc --zero-superblock /dev/sdb) reboot and recreate partitions and assign /dev/sdb1 again to the raid.

Do you mean that this fixed the problem for you? I "solved" my problem by swapping my sdb with a spare drive.

willow_of_oz 01-31-2009 04:06 AM

Quote:

Originally Posted by KingPong (Post 3426896)
Do you mean that this fixed the problem for you? I "solved" my problem by swapping my sdb with a spare drive.

I had the same issue.
I also solved mine the same way - I zeroed the superblock from the device (well, two devices, since I did it to two on a raid 6 array).


All times are GMT -5. The time now is 10:49 AM.