[SOLVED] How can I coax RHEL/CentOS/SL 7 into booting normally with degraded software RAID?
Linux - SoftwareThis forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Distribution: Scientific Linux 6.1, Scientific Linux 7, Ubuntu, Parted Magic; formerly Mandrake 10.1, etc.
Posts: 7
Rep:
How can I coax RHEL/CentOS/SL 7 into booting normally with degraded software RAID?
I set up a new server (my first with this version of Linux). I installed a pair of 160 GB blank SATA HDDs (one Seagate and one WDC, but with exactly the same number of LBA sectors) in an old machine, and set out to install Scientific Linux 7.0 (rebranded RHEL) in a RAID 1 (software mirrored) configuration.
The first hiccup was that I couldn't figure out how to get SL / RHEL installer (Anaconda) to set up the two drives for RAID1. So I booted from a PartedMagic CD, and used it to do the partitioning.
I partitioned the two drives identically. Each drive has a big partition for RAID1+ext4 to be mounted at /, a small (currently unused) partition for RAID1+ext3 to be mounted at /safe, and a 3GB Linux Swap partition. I used fdisk to change the types of the RAID partitions on each drive to FD, and mdadm to build the RAID arrays:
mdadm --create --verbose /dev/md0 --raid-devices=2 --level=1 /dev/sda1 /dev/sdb1
mdadm --create --verbose /dev/md1 --raid-devices=2 --level=1 /dev/sda2 /dev/sdb2
Then I shut down, booted the SL DVD, and tried the install again. This time the installer recognized the RAID1 arrays, formatted them for ext4 & ext3, respectively, and installed smoothly.
At this point, everything seemed okay. I shut it down, started it again, and it booted fine. So far so good.
So then I tested the RAID1 functionality: I shut down the computer, removed one of the drives, and tried to boot it. I was expecting it to display some error messages about the RAID array being degraded, and then come up to the normal login screen. But it didn't work. Instead I got:
Welcome to emergency mode! After logging in, type "journalctl -xb" to view
system logs, "systemctl reboot" to reboot, "systemctl default" to try again
to boot into default mode.
Give root password for maintenance
(or type Control-D to continue):
The same thing happens regardless of which drive is missing.
That's no good! The purpose of the mirrored drives is to ensure that the server will keep on running if one of the drives fails.
Ctrl-D just gets me back to a repeat of the same "Welcome to emergency mode" screen. So does entering my root password and then "systemctl default".
So then I tried an experiment. At the boot menu I pressed "e" to edit the kernel boot parameters, and changed "rhgb quiet" to "bootdegraded=true" and then booted. No joy.
That let me see more status messages flying by, but it didn't enable the machine to boot normally when a drive was missing. It still stopped at the same "Welcome to emergency mode" screen. The following is what I saw with the Seagate drive removed, and the WDC drive remaining. The last few lines look like the following (except that "...." denotes where I got tired of typing):
[ OK ] Started Activation of DM RAID sets.
[ OK ] Reached target Encrypted Volumes.
[ 14.855860] md: bind<sda2>
[ OK ] Found device WDC_WD1600BEVT-00A23T0.
Activating swap /dev/disk/by-uuid/add41844....
[ 15.190432] Adding 3144700k swap on /dev/sda3. Priority:-1 extents:1 across:3144700k FS
[ OK ] Activated swap /dev/disk/by-uuid/add41844....
[ TIME ] Timed out waiting for device dev-disk-by\x2duuid-a65962d\x2dbf07....
[DEPEND] Dependency failed for /safe.
[DEPEND] Dependency failed for Local File Systems.
[DEPEND] Dependency failed for Mark the need to relabel after reboot.
[DEPEND] Dependency failed for Relabel all file systems, if necessary.
[ 99.299068] systemd-journald[452]: Received request to flush runtime journal from PID 1
[ 99.3298059] type=1305 audit(1415512815.286:4): audit_pid=588 old=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:auditd_t:s0 res=1
Welcome to emergency mode! After logging in, type "journalctl -xb" to view
system logs, "systemctl reboot" to reboot, "systemctl default" to try again
to boot into default mode.
Give root password for maintenance
(or type Control-D to continue):
So it appears that installing on RAID1 mirrored drives will just double the chance of a drive failure bringing down the server (since there are two drives instead of one). That is not what I was hoping to achieve w/ mirrored drives.
Does anyone know how to make it boot & run "normally" (with a degraded RAID1 array) when a hard disk drive fails?
Two other notes:
1. I'm new to RHEL/SL/CentOS 7, so at the "Software Selection" screen, during the SL installation, I had to do some guessing. I chose:
"General Purpose System" +
FTP Server,
File and Storage Server,
Office Suite and Productivity,
Virtualization Hypervisor,
Virtualization Tools, and
Development Tools
2. I'm seeing some apparently-innocuous errors: ATAx: softreset failed (device not ready)
The "x" depends on which drives are installed. I get more of those errors with two drives installed than with only one.
Whole disk RAID means you have to be careful. Since the individual disks aren't labeled they look blank to most software. It's better to do software RAID on partitions.
Distribution: Red Hat Enterprise Linux, Mac OS X, Ubuntu, Fedora, FreeBSD
Posts: 89
Rep:
Quote:
Originally Posted by smallpond
Whole disk RAID means you have to be careful. Since the individual disks aren't labeled they look blank to most software. It's better to do software RAID on partitions.
Careful how? You don't need to use partitioning software on partition-less disks. I think it is a moot point. On the plus side, one of the benefits of a partition-less disk is the ability to easily resize on the fly. For example, to resize the root partition in the previous example you can simply run:
You don't even need an outage window to do it because it's an online on the fly resize, and any junior level linux admin is capable of running that command. There is less risk of human error, the last thing you want is a junior admin manually mucking around with the partition tables.
Another example, lets say you want to migrate your system to larger LUNs, this is easily accomplished with:
Thank you, that's an excellent thought. Even if what I did had worked, it still might make the system vulnerable to a failure if a disk drive developed a bad block within the swap partition. It's obviously better to use a swap file on the RAID1 partition; I don't know what I was thinking.
So I deleted the swap partition entries from /etc/fstab and added a 3 GB swap file on the ext4 file system, on md0.
That worked fine when both drives were present. But when if I removed one drive I was right back at the Emergency Mode prompt again.
However, when I checked the log with "journalctl -xb" I noticed that it was complaining about my 2nd array (/dev/md1, with the ext3 filesystem), not the main ext4 filesystem. So I commented-out that line in /etc/fstab and tried again, and this time it booted properly, from the degraded RAID1 array!
Apparently the problem booting wasn't due to my main /dev/md0 RAID1 array, it was because of the other partitions: the two swap partitions, and the /dev/md1 RAID1 array.
After I shut down and reinstalled the missing drive, I started the machine again, and it was still running with just one drive. But I did "mdadm --add" to add the missing drive back, and its state went to "spare rebuilding" for a while, and then to "active."
In other words, it's working perfectly.
I thank you both very much for your helpful advice!
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.