Can an md RAID-1 device auto-rebuild?

alfino · 06-05-2015, 05:11 PM

Hi

md RAID-1 on CentOS 6.6:
md0 /boot (200MB), md1 swap (8GB), md2 / (100GB)

How come the swap (md1) is automagically rebuilt and active (up/running) BUT md0 (.boot) and md2 (/) are NOT so, when I first boot,

AFTER I'd (to test the mirror):

"Powered Off, Removed one Drive, Powered On/booted, Powered Off again, re-connected aforementioned drive, and Powered back up?"

to test the RAID-1 "bootability" with one drive missing?

How can I get md0 and md2 to do the same?

joec@home · 06-05-2015, 06:16 PM

Swap is nothing but one huge temporary file space that gets deleted when the server boots. Not much to rebuild after that.

Fred Caro · 06-05-2015, 07:54 PM

I don't know what version of mdadm centos 6.6 uses but if you are putting /boot on a separate partition it might be an old one.
If you are booting from the raid hdd what is the output of

Quote:

cat /proc/mdstat

As I understand it, the least needed is for you to --add the reintroduced drive using mdadm and grub needs to be installed on (assuming only 2) both /dev/sda's.

Fred.

frostschutz · 06-05-2015, 08:04 PM

If nothing ever wrote to the swap it might be that they were still considered clean & in sync despite one disk temporarily removed.

alfino · 06-06-2015, 01:50 PM

Quote:

Originally Posted by Fred Caro

..--add the reintroduced drive using mdadm and grub needs to be installed on (assuming only 2) both /dev/sda's.

During the CentOS 6.6 install, I setup the partitions using "custom" and made the 3 partitions that way (create raid partitions, then a raid device. repeat). What's the issue w/my partitioning scheme?

When asked where to put the boot loader, I chose /dev/md0 over either /dev/sda or sdb because I wanted resiliency if one drive died. My tests booting with one drive disconnected (tried both) worked flawlessly except for the issue in the OP (if it even is one).

Code:

$ cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sda1[2] sdb1[3]
      205760 blocks super 1.0 [2/2] [UU]

md1 : active raid1 sdb2[1] sda2[2]
      8389632 blocks super 1.1 [2/2] [UU]

md2 : active raid1 sdb3[1] sda3[2]
      102401024 blocks super 1.1 [2/2] [UU]
      bitmap: 1/1 pages [4KB], 65536KB chunk

unused devices: <none>

$ mdadm --version
mdadm - v3.3 - 3rd September 2013
$

More bg:
My initial attempt was using IRST in the BIOS. THAT worked good, and so did the "disconnect-a-drive" test. EVEN better than md, after adding the "disconnected" disk back to the array in the BIOS, once Linux booted, the md arrays ALL were automagically rebuilt, without *any* intervention on my part.

Why then the switch to md? After the rebuild of the array, the system wouldn't boot, so I dropped it (to be fair, the BIOS was set to use *only* UEFI, and my "incomplete understanding" of that could've affected the bad outcome (I also used CentOS 7 that time.) I switched the BIOS back to "traditional" legacy "Auto" CSM + UEFI mode, and installed from the 6.6 boot CD using its' "Legacy" bootlader and not the UEFI one.) Besides, consensus is that IRST sucks vs 100% Linux md.

Back to the topic at hand:
An ancillary issue appeared. Altering the "disconnect" test, I inserted a brand new (identical) sda drive instead just reconnecting the original.

sfdisk -d /dev/sdb | sfdisk /dev/sda
mdadm -add

That's it. Array rebuilt, reboots fine. But booting off the new disk "alone" failed. "grub-install /dev/md0" fixed that but subsequently removing sda gave "Hard Disk Error" upon boot from sdb (the orig disk). i.e., the system now *only* boots from the "new" disk, despite both disks being cleanly part of the array.

The solution was to go into the grub cli and "setup" on each of the drives (which I think is diff from using just md0) because then, everything worked again, no matter which drive I left in the machine solo.

(I suspect it has something to do with the 3rd UUID of the "new" disk added in to "test")

SO there are really two (2) issues but I wanted to start with only the first one detailed in my OP and try to tackle that first.

Fred Caro · 06-08-2015, 03:54 PM

Alfino, I am fairly new to raid myself but..it seems the grub cli was able to write to the harddrive mbr which is referenced by the bios when the machine boots, hence the "Hard Disk Error" before correcting with the grub cli.

In an msdos partitioning setup the bios needs to read the mbr to reference an upto date grub.cfg file in /boot/grub. Thus grub is installed to all HDD's that are required to boot the system on their own.

Anyone is quite free to correct me if I'm wrong!

Fred.

alfino · 06-08-2015, 07:11 PM

Quote:

Originally Posted by Fred Caro

Alfino, I am fairly new to raid myself but..it seems the grub cli was able to write to the harddrive mbr which is referenced by the bios when the machine boots, hence the "Hard Disk Error" before correcting with the grub cli.

In an msdos partitioning setup the bios needs to read the mbr to reference an upto date grub.cfg file in /boot/grub. Thus grub is installed to all HDD's that are required to boot the system on their own.

Recall though that I built the system with grub installed to:

/dev/md0

as opposed to either /dev/sda or /dev/sdb, OR BOTH /dev/sda & /dev/sdb

And I was able to boot off of either drive as setup with /dev/md0.

The problem only occurred *after* I completely replaced one of the drives (as a test), re-synced them, and issued a:

grun-install /dev/md0

To set back up grub correctly (at least that's what I thought should've happened)

That is when I ran into that secondary issue I detailed above of the "Hard Disk Error"..