Slackware with root on RAID1: mdadm vs. hardware

thenktor · 11-02-2018, 05:13 AM

Hi,

I want to setup a Slackware server. It's main purpose is:
1. acting as docker host for several containers
2. acting as file server

Although installing Slackware with the necesseary packages (and restoring the config files from a backup) is a quite fast and straightforward task (I guess less than 3h) I'm interested in using RAID1 for the root file systems. But I wonder if mdadm or hardware RAID is the preferable way to go.

What I've thought about so far:
Hardware RAID:
* Used LSI SAS can be bought for cheap on eBay, so buying 2 of them for having a spare part also won't be a problem.
* Last time I've used a hardware RAID it was quite simple to use. IIRC I just needed to configure the RAID in the controller bios, boot installer as usual and set things up. Linux won't even notice, that it is running on a RAID
* To watch RAID status some manufacturer specific tools are need, which may not work correctly on Slackware. Maybe this is also possible with SMART?
* Controller won't have battery backup, so write cache has to be disabled. Is it a performance problem with SSDs?

mdadm:
* Standard tools, good documentation, works everywhere
* Needs some more care when installing
* Can LILO still boot, when a drive fails/is removed?

Which way would you chose?

chemfire · 11-02-2018, 06:54 AM

LILO can still boot; others might have different experiences or know a better way. I used to do raid1 on my work station back in the day when drivers were less reliable and media for making quality frequent backups at home was more costly than today.

What I always did was create a non-raid boot partition on each drive.

I would mount the second drives boot partition somewhere like /boot.2nd or similar.

Copy the contents of /boot to /boot.2nd; anytime you change the kernel

Install lilo to both the drives boot records; change lilo conf such that the kernel it points to is that of the second drive before you install lilo the second time. You can also make boot entries for both copies of the kernel and just switch the order; Or not even worry about doing that if you don't need the be sure the system can reboot unattended and are willing to manually pick the second boot entry from the lilo menu if the primary drive fails.

Lilo has switch to indicate you what device you want to install it on, so you can put together scrips to do most of this easily.

The kernel, whichever copy starts, will be able to assemble the raid volumes on the 2nd, 3rd, ... Nth partitions of the metadata data even if they are degraded (missing one drive)

- If you get an EFI system this would probably be even easier with ELILO as you'd just need to install that twice once, so the EFI system knows about both and then you can just change the config file on the respective boot partitions.

3rensho · 11-02-2018, 07:09 AM

I am using a RAID1 set (mdadm) for Slackware64-current and it works fine. I use grub for booting.

Mark Pettit · 11-02-2018, 08:33 AM

OK - consider this - you want RAID for failure from recovery. But what if the computer itself fails? You want to be able to remove the hard disks, drop them into another machine and boot ... But not if you used hardware RAID. Moral of the story - go with mdadm.

upnort · 11-02-2018, 11:22 AM

First rule of RAID is RAID is for redundancy and is not a substitute for a backup strategy.

RAID is for "business continuity" only.

With a hardware controller, all aspects of configuring the array are performed from the controller firmware. An operating system is not involved. A hardware controller presents a "virtual disk" to the operating system. Typically an operating system cannot break the veil of the virtual disk and only sees /dev/sda. A virtual disk means an operating system does not know about the controller.

The size of the virtual disk can be increased by changing to a RAID type that uses striping. RAID 1 does not support striping. The virtual disk of a RAID 1 is the size of the smallest disk. Typically with RAID 1 the two disks are the same size and from the same vendor.

Preparing disks on a controller varies with the size of the disks and array. Also on the type of initialization. A short initialization will be faster but a full initialization will wipe the disks. Basically writing zeroes to force disk firmware to map bad sectors. A short init usually is less than 15 minutes or so and a full init can take several hours. You could use dd to perform your own zero wipe of disks and then in the controller config perform a short init.

Breaking the veil of the controller from within an operating system requires special software. For example, something like megacli will expose the controller configuration. Or some vendor specific software. To my knowledge there are no open source RAID firmware software. Megacli is popular but is closed source.

Hardware RAID stores array metadata on both the controller and disks. If a controller fails, generally if the replacement controller is from the same vendor and controller family, the controller will recognize the metadata on the disk. The controller firmware will treat disks as "foreign" but allow importing the meta data to the controller. All bets are off if the replacement controller is from a different vendor. If the replacement controller is from the same vendor but too many generations in between, then the meta data on the disks might not be recognized.

When a disk fails with a hardware controller, usually there is a pecking order of steps to replace. Basically, remove the disk from the array, configure the disk as physically unavailable, then actually remove the disk. Inserting the replacement disk is the reverse. The controller will see the replacement disk as a foreign disk and the controller needs to be told what to do. Once the disk is no longer considered foreign, the controller will start rebuilding the array. Since RAID is about business continuity, replacing a disk "hot" while the operating system is live is preferred. Otherwise a shutdown is required to configure the replacement disk through the controller firmware.

Many hardware controllers support hot spares. A hot spare is a spare disk that when one disk in the array fails, the controller automatically uses to start rebuilding the array. With RAID 1 that means three disks.

Although typically faster, you do not need to buy SAS drives. SAS is compatible with SATA but not vice-versa. If you use RAID 1 then you do not have to get finnicky about using disks designed for RAID.

SMART can penetrate the veil of the controller to some degree, at least with respect to SMART attributes. The syntax for accessing disks on a hardware controller is a tad funky.

Stay away from really old controllers. They work fine but require obtuse software to access from within the operating system. Old controllers are unlikely to support SSD or SATA III.

Software RAID is managed only from with the operating system. The metadata is stored on the disks only because there is no controller.

I do not know how well Linux software RAID supports hot spares.

I am not a RAID guru and don't play one on TV. That said, at work I maintain a handful of Dell R710 servers using Dell H700 and 6i hardware controllers. We have two Supermicro systems but I don't recall the controller types. We use megacli and smartmontools to query health status of the disks.

We have one test system running Linux software RAID 1. We are planning a new office server and that system will be Linux software RAID 1. I have no experience with drive failures with software RAID -- something that is on my to-do list to learn before installing the new office server.

None of the servers at work run Slackware.

The RAID principles are nonetheless the same. I can't share experiences with running Slackware on hardware controllers, but I see no reason why anything should be different.

One way or another you'll want email alerts to monitor the array. I do that with shell scripts and cron jobs. The shell scripts use both megacli and smartctl.

Which would I choose? Depends. For a home or small business server probably software RAID. If I wanted experience and expand skills or wanted to run a home lab then a controller. You won't see any speed differences between the two approaches because RAID 1 is a simple mirroring system.

Most home servers do not have a critical need for business continuity. For me, for many home servers RAID is overkill because of the complexity. I do not use RAID at home. Instead I use a good backup strategy that I test regularly.

The second rule of RAID is never depend upon RAID for backups.

Always have a good backup plan in place.

I hope that helps!

TracyTiger · 11-02-2018, 12:11 PM

I've been using LILO booted LUKS encrypted software RAID1 on multiple machines for years. It works great for me on hard disks and SSD. All disks/SSD including root are encrypted and part of a RAID, except /boot is not encrypted but is still part of RAID.

My experience is that the MBR boot record is written automatically (with a messages stating such) on all members of the RAID. I've verified this by disconnecting all but one device at a time and it always boots without problems.

I've not had to use the process chemfire described above. It may be that older software required it and newer versions don't.

For my use as servers and business applications, the performance is good and for my uses I don't notice any slowing due to the extra processing of my setup. I use medium performance machines that get swapped out every 6 or 8 years.

In another lifetime I used hardware RAID on a variety of manufacturers machines (data center equipment) which worked fine (with spare hardware). I find software RAID easier to use and I don't have to deal with firmware updates/compatibility from years ago. For my small setups with battery UPS I'm not concerned about the lack of battery backed up memory on the disk controller card that was commonly found with hardware RAID. As stated by others, RAID is not BACKUP. Perhaps hardware RAID is less complicated and more transparent these days.

Previously I found it necessary to put the LVM option ("-L") in the mkinitrd command when building initrd for RAID to work. This may have changed in -current or updates to 14.2.

EDIT: Software RAID tip # 2 & 3 - Let all RAID members finish building before installing software/data on the RAID. This may take many hours depending upon disk sizes. "cat /proc/mdstat" to check or use a mdadm command.

Increase minimum speed on RAID build speed for faster build. "echo 100000 > /proc/sys/dev/raid/speed_limit_min"

chemfire · 11-02-2018, 03:45 PM

I have not had a software raid system on the boot drive sine like Slackware 11.0; so its very very possible things have changed.

twy · 11-02-2018, 04:29 PM

I have used mdadm RAID on slackware64 for about the last 7 years without problems. About 8 years ago, I was using LSISAS-1068E RAID1 (firmware) with two 2x1TB WD RE (raid edition) disks that have the proper time-limited error recovery (TLER) feature for RAID use. The firmware worked fine for RAID1, but I never had a failure to see how it recovers.

After a year, I decided to buy two more 1TB WD RE disks and configure them as a 2-disk mdadm RAID5. Each disk was partitioned using gdisk (GPT partitions) with a small boot partition and the remaining space for the root disk. I left 8MB (or more) of unpartitioned space at the beginning of each disk, end of each disk, and between the boot and root partitions. The partitioning is the same for each disk, using sgdisk to copy the partitoning of one disk to another disk. Leaving the space at the end of the disk helps protect in case a replacement is slighly smaller, and to make sure there is room at the end of disk for metadata.

The boot partitions are put into RAID1 with metadata version 0.90, which I think was required for the RAID1 to be understood by LILO's mdadm RAID1 boot support. In lilo.conf, it looks like this:
append=" vt.default_utf8=1 "
boot = /dev/md0
lba32
raid-extra-boot = mbr-only
compact

The root partitions were put into RAID5 metadata version 1.2 with 64K chunk size at /dev/md1. 64K chunk size has worked well for me, since it strikes a balance of good rewrite speed and less load on the disks if you are making many small writes.

LUKS is installed over md1 for full root encryption.
LVM2 is install over LUKS/md1 to make an actual logical volumes for root device, a swap device, and some unallocated LVM2 blocks in case I want to make another disk device for something like a KVM vm.

Options in /etc/mkinitrd.conf allow the initrd, at boot time, to understand this kind of configuration and prompt for the LUKS key, start LVM2, and use the root and swap devices. The swap device is not very good (very slow) if it becomes active, but it does work and is maybe better than getting really out of memory.

Options to tune2fs (using ext4 fs on LVM2 root device and on the plain RAID1 boot device) can set "stride" and "stripe_width" (in 4K fs blocks) to help improve performance with the RAID5. These options need careful calculation to set them properly.

Options in /etc/rc.d/rc.local:
# how much dirty data to wait for before flushing
echo xxx > /proc/sys/vm/dirty_background_bytes
# how much cache for raid5/6... should be some multiple m of the FULL stripe size
# that includes the parity data.
echo xxx > /sys/block/md1/md/stripe_cache_size
# read-ahead from the RAID1 disks
/sbin/blockdev --setra xxx /dev/sd[abc...]
# read-ahead from the RAID5/6.. should be some multiple m of the data-only
# stripe size... should match the stripe_cache_size, but only for the data part
/sbin/blockdev --setra xxx /dev/md1

More info on these parameters and details can be found on the web, I guess.

So, slackware64 supports booting and installing on this kind of disk config using an initrd (mkinitrd).

After running on this kind of two-disk RAID5 for a while, I added the other two 1TB disks to it as spare disks. Then, I used mdadm to grow+reshape the 2-disk RAID5 into a 4-disk RAID6 while online. Later, I bought four more 1TB WD RE and added them all as spares to the RAID6, then again used mdadm to grow the 4-disk RAID6 into an 8-disk RAID6.

After growing+reshaping, you have to adjust some of the parameters above for cache and in tune2fs. Well, (maybe first) you also have to resize the LUKS device using cryptsetup to let it take in the new space in the underlying RAID device (the default resize action). Then, you also have to use lvm's "pvresize" to take in the enlarged underlying LUKS disk to grow the volume group (vg). Then, you use lvm's "lvextend" to take in more of the vg free space into the LVM root device. Um, then you use resize2fs to resize the ext4 root filesystem. It may seem to be a scary sequence of commands, but it all works!

As you add disks, the RAID1 can also take in more boot partitions as spare disks, which can also be added to the RAID1 as active disks in an n-disk RAID1 (not only 2-disk RAID1). This way, all of the disks are automatically holding a copy of the boot disk.

Basically, if you would like to have the ability to add disks to your raid, you can just start with a two-disk RAID5 instead of a 2-disk RAID1. The LUKS on top encrypts everything except the RAID1 boot disk. LVM2 on the root disk allows you some more flexibility, such as maybe separate a /home logical volume or other breakdown of the root disk structure.

The configuration is a little bit technical, but not too bad. Again, I have run like this for 7 years with no problems. A disk failed before, and I pulled it out and put in a replacement, which I used sgdisk to copy over the partitioning and then used mdadm to add boot and root partitions as spares into the md0 and md1. The rebuilds ran (seeing cat /proc/mdstat) and everything recovered fine. RAID6 can run online even if two disks fail. The performance of this setup is at least as good as a normal disk, or has been good enough for my needs without noticing any problem.

About once a month, it is a good idea to run:
echo check > /sys/block/md1/md/sync_action
on the RAID5/6 to let it check for any disk errors (run dmesg after to see kernel messages)
Run, smartctl -a /dev/sdx, to see error counts. If any disk is failing with reallocated sectors, then probably better to go ahead to replace it before total failure.

Once again, you must create a basic initrd with mkinitrd and with correct settings in /etc/mkinitrd.conf to use mdadm/luks/lvm2/ext4 correctly. Making an initrd has to be done for each new kernel you install - best to generate the initrd-tree-xxx and initrd.gz-xxx for each kernel version xxx, and have lilo use kernel image /boot/vmlinuz-xxx with initrd /boot/initrd-gz-xxx. When you run lilo, with /dev/md0 RAID1 device as boot disk, lilo will write out to the MBR on each RAID1 member disk with code to load the linux kernel on the member disk. So, the boot is protected too.

So, that is the idea. Maybe it missed some details, but that is mainly it. Do your own further research on this and good luck.

lazardo · 11-26-2018, 11:05 PM

I've used software raid since dark ages (raidtools, mid-90s) on every home and work system that runs linux and has more than one disk. Here's what works without too much trouble:

* raid0 (stripe set) == fast & dangerous. Used for scratch/crash&burn, not as useful with solid-state media
* raid1 (mirror) == fast read, slow write, redundancy, simple. Used for boot partition w lilo
* raid10 (mirror) == 2x read, 1.5x write, redundancy, less simple. This is the 2 disk model using '--layout=f2' (see md(4) for details). This is the one to use unless you need raid0 or raid1

More trouble than the above:
* raid5 == slower performance, better redundancy, bigger blobs. More complex, long init and rebuild times, not recommended in most cases.
* raid6 == raid5 + even better redundancy and complexity and longer init, rebuild times. Recommended if you need big storage blobs ala real storage servers, not for desktop/laptop.
* All other variants are essentially historical artifacts in modern environments (this includes hardware controllers).

Non-obvious:

* mdadm is deceptively non-trivial but very rewarding for performance, failure recovery, [disk] portability and migration. That being true, learning curve just to get to 'comfortable' is steep.
* raid initialization happens in the background which means as soon as the array is configured it can be used with a small performance hit. Similar thing happens with ext4.
* A nice starter kit for 2 disks would be: 1 small raid1 partition for /boot, other raid10_f2 partitions as needed, lilo with:

boot = /dev/md1
raid-extra-boot = "/dev/sda,/dev/sdb"
compact
large-memory
lba32

root = /dev/md2

* see https://raid.wiki.kernel.org/index.php/Linux_Raid
* see https://www.thomas-krenn.com/en/wiki...ery_and_resync

* bonus: mdadm supports background scrubbing and also has a write intent bitmap that makes for rebuild times measured in minutes rather than hours.

Cheers,

koloth · 11-30-2018, 05:11 AM

Keep in mind that SW RAID always carries a performance penalty.
Although I never has any issues with Linxu raid on Slackware or Red Hat I tend to prefer HW RAID if available if only for pure performance reasons.
RAID is definitely not backup and should not be used as such. If you are worried about total system failure then a good backup strategy will save you, not SW RAID.
Even an image with a tool like clonezilla and a decent service backup will do.
My opinion, Go with HW RAID

Gerard Lally · 11-30-2018, 06:24 AM

Quote:

Originally Posted by koloth

Keep in mind that SW RAID always carries a performance penalty.

Whatever about write performance, mdadm RAID1 read performance can be faster, if anything, and certainly shouldn't be slower.

lazardo · 12-05-2018, 03:08 PM

Here's my laptop and its RAID10 big performance hit.

Code:

~/Desktop$ sync; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

~/Desktop$ uptime; sudo /usr/bin/time dd if=/dev/md5 of=/dev/zero bs=1G count=32; uptime
 12:54:16 up  3:09,  0 users,  load average: 0.29, 0.40, 0.28
32+0 records in
32+0 records out
34359738368 bytes (34 GB, 32 GiB) copied, 34.3847 s, 999 MB/s
0.00user 9.88system 0:34.38elapsed 28%CPU (0avgtext+0avgdata 4201360maxresident)k
67110176inputs+0outputs (1major+1113minor)pagefaults 0swaps
 12:54:50 up  3:10,  0 users,  load average: 0.68, 0.49, 0.31

~/Desktop$ fdisk -l /dev/sd{a,b}
Disk /dev/sda: 238.5 GiB, 256060514304 bytes, 500118192 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xdbc21b04

Device     Boot    Start       End   Sectors  Size Id Type
/dev/sda1  *        2048    206847    204800  100M fd Linux raid autodetect
/dev/sda2         206848  42149887  41943040   20G fd Linux raid autodetect
/dev/sda3       42149888  84092927  41943040   20G fd Linux raid autodetect
/dev/sda4       84092928 499329023 415236096  198G  5 Extended
/dev/sda5       84094976 499329023 415234048  198G fd Linux raid autodetect
Disk /dev/sdb: 238.5 GiB, 256060514304 bytes, 500118192 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x49563bd8

Device     Boot    Start       End   Sectors  Size Id Type
/dev/sdb1  *        2048    206847    204800  100M fd Linux raid autodetect
/dev/sdb2         206848  42149887  41943040   20G fd Linux raid autodetect
/dev/sdb3       42149888  84092927  41943040   20G fd Linux raid autodetect
/dev/sdb4       84092928 499329023 415236096  198G  5 Extended
/dev/sdb5       84094976 499329023 415234048  198G fd Linux raid autodetect

~/Desktop$ cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] 
md1 : active raid1 sdb1[1] sda1[0]
      102336 blocks [2/2] [UU]
      
md2 : active raid10 sdb2[1] sda2[0]
      20971264 blocks 128K chunks 2 far-copies [2/2] [UU]
      bitmap: 1/1 pages [4KB], 65536KB chunk

md3 : active raid10 sdb3[1] sda3[0]
      20971264 blocks 128K chunks 2 far-copies [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

md5 : active raid10 sdb5[1] sda5[0]
      207616768 blocks 128K chunks 2 far-copies [2/2] [UU]
      bitmap: 0/2 pages [0KB], 65536KB chunk

unused devices: <none>

~/Desktop$ uname -a
Linux msi 4.9.74 #44 SMP Sat Jan 20 23:51:59 PST 2018 x86_64 Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz GenuineIntel GNU/Linux

~/Desktop$ cat /etc/sl*n
Slackware 14.2

~/Desktop$ arch
x86_64

~/Desktop$ cat /proc/cpuinfo | tail -20
cpu MHz		: 800.000
cache size	: 6144 KB
physical id	: 0
siblings	: 8
core id		: 3
cpu cores	: 4
apicid		: 7
initial apicid	: 7
fpu		: yes
fpu_exception	: yes
cpuid level	: 22
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
bugs		:
bogomips	: 5186.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual
power management:

rworkman · 12-05-2018, 11:31 PM

Personal preference, ultimately - you should use whichever solution you are most comfortable being responsible for its reliable operation.

That said, I *strongly* prefer mdadm; there have been quite a few occasions in which I as able to pull a drive from an array on one machine and insert it into a completely different machine without any concern about whether I could read the data off. That's worth a small performance penalty to me, if one indeed exists (I'm not convinced that it does, for what it's worth).

kikinovak · 12-06-2018, 02:59 PM

Quote:

Originally Posted by Mark Pettit

OK - consider this - you want RAID for failure from recovery. But what if the computer itself fails? You want to be able to remove the hard disks, drop them into another machine and boot ... But not if you used hardware RAID. Moral of the story - go with mdadm.

Remove one of the disks, put it into another machine, boot up a live system, build a dummy software RAID array with just one disk on the fly and then mount it. Works like a charm.

Cheers,

Niki

lazardo · 03-31-2019, 02:16 AM

Quote:

* raid10 (mirror) == 2x read, 1.5x write, redundancy. This is the 2 disk model using '--layout=f2' (see md(4) for details). This is the one to use unless you need raid0 or raid1

Forgot to mention there's a time cost for 2-disk raid10-f2, in that the initial resync takes 2x longer. Generally more useful for smaller (<=1GB), high performance + interractive workspace than 4TB archives. If you're ok with overnight resyncs then no worries.

An internal bitmap will protect against most subsequent long resyncs.