Self training with a HW RAID controller

upnort · 03-17-2018, 12:55 PM

I am looking for suggestions to provide myself some bootstrap training.

I soon will have access to a Dell R710 with an H700 RAID controller with five 1 TB disks. I want to learn more about monitoring and responding to various RAID events. Basically I want to write my own lab training lesson plan.

I have only worked with RAID controllers already in production and never have faced failures or degradation events. Although by no means a guru, I am familiar with the megacli command and have written some simple megacli script wrappers. There are oodles of megacli articles online. I am not interested in the megacli command as much as I am interested in simulating common types of degradation and fixing the problem.

Just looking for a list of real-world things to learn.

Thanks!

jlinkels · 03-18-2018, 03:38 AM

AFAIK the Dell RAID shows up as one single drive in Linux. I am not sure if there are applications to access the RAID controller from within Linux.

If you install VMWare on the Dell the RAID is seen as a single drive and I am sure there is no way to access the controller from VMWare or from a guest. Not even when using the Dell version of ESXi.

You communicate through iDRAC with the RAID controller for configuration and monitoring.

It seems to be Dell's intention to "set and forget" the RAID. If you have a defective disk, you pull it out and insert a new one. It will rebuild.

I must advice against RAID5. Rebuilding very large arrays (a few TB) might take longer than acceptable for operating a degraded array.

jlinkels

upnort · 03-18-2018, 12:26 PM

Quote:

I am not sure if there are applications to access the RAID controller from within Linux.

megacli

TB0ne · 03-18-2018, 01:41 PM

Quote:

Originally Posted by jlinkels

AFAIK the Dell RAID shows up as one single drive in Linux. I am not sure if there are applications to access the RAID controller from within Linux.

If you install VMWare on the Dell the RAID is seen as a single drive and I am sure there is no way to access the controller from VMWare or from a guest. Not even when using the Dell version of ESXi. You communicate through iDRAC with the RAID controller for configuration and monitoring. It seems to be Dell's intention to "set and forget" the RAID. If you have a defective disk, you pull it out and insert a new one. It will rebuild.

Agree with this. While there are utilities that you CAN use to monkey around with a RAID array, it's really best (in terms of a hardware RAID solution), to let the controller do it. You can get....interesting...results otherwise. You should still be able to see the individual drives with smartctl and other utilities, though, as far as I remember. But when I build a HW RAID, I'll usually monitor it with SNMP, which will return all the goodies.

Quote:

I must advice against RAID5. Rebuilding very large arrays (a few TB) might take longer than acceptable for operating a degraded array.

I've had mixed results. If you have a controller with a decent amount of RAM, and the disk isn't getting hammered, you can get a fairly quick rebuild, but still it's going to be a while. But at least your system isn't down during that time, and even if it takes a week or so to get rebuilt, your system is still up. And, the chances of a second drive failing in that window is pretty small as well.

upnort · 03-19-2018, 02:32 PM

Well, um, thanks, I guess.

I was not asking for opinions about troubleshooting or theory. I was asking for ideas how to conduct some self-training with a RAID controller. For example, pull a drive to degrade the array, replace the drive, and monitor the array rebuilding. I am looking for various ways to degrade an array and learn how to respond. My goal is I have basic experience monitoring arrays but no experience handling actual failures.

Quote:

You should still be able to see the individual drives with smartctl and other utilities

With many controllers smartctl will pierce the veil to see individual drives, but fails to provide information about the array. The megacli command works great to query a supported RAID controller and fill that void.

Quote:

I must advice against RAID5. Rebuilding very large arrays (a few TB) might take longer than acceptable for operating a degraded array.

Well, with large disks rebuilding any RAID takes a long time. I have access to one system with 1 TB drives using RAID 1 with one hot spare. I have seen that system take most of the day to rebuild the array. My cynical opinion why RAID 5 tends to be pushed is to sell more hard drives. To be fair, a 3-disk RAID 5 provides an additional disk of redundancy compared to RAID 1, but can only suffer one disk failure just like RAID 1. Although striping improves overall throughput I am not fond of the whole striping thing. Just unnecessary complexity for most users. Yes, backups are required but to me, RAID 1 is simpler to maintain and recover. That all said, I have an opportunity with this Dell R710 to learn more about RAID and that was my hope with this thread.

TB0ne · 03-19-2018, 03:42 PM

Quote:

Originally Posted by upnort

Well, um, thanks, I guess.

I was not asking for opinions about troubleshooting or theory. I was asking for ideas how to conduct some self-training with a RAID controller. For example, pull a drive to degrade the array, replace the drive, and monitor the array rebuilding. I am looking for various ways to degrade an array and learn how to respond. My goal is I have basic experience monitoring arrays but no experience handling actual failures.

I would strongly suggest you don't just 'pull a drive' to degrade the array, unless you have hot-swap drives. However, for testing, powering the system off and pulling a drive will get the array degraded, and let you look.

There *MAY* be tools that can let you see what you can with an mdadm command (for software RAID), but that depends on your controller. With something like RAID5 or 6, the system won't even notice, and will continue as normal. Unless you poke through the logs or have SNMP set up, you won't notice. Putting the drive in is similarly invisible...the controller is doing the grunt work there.

Quote:

With many controllers smartctl will pierce the veil to see individual drives, but fails to provide information about the array. The megacli command works great to query a supported RAID controller and fill that void.

It can, but it may not (as you say) depending on controller. LSI and Adaptec have utilities for their hardware RAID controllers that will get you further than the standard Linux utilities.

Quote:

Well, with large disks rebuilding any RAID takes a long time. I have access to one system with 1 TB drives using RAID 1 with one hot spare. I have seen that system take most of the day to rebuild the array. My cynical opinion why RAID 5 tends to be pushed is to sell more hard drives. To be fair, a 3-disk RAID 5 provides an additional disk of redundancy compared to RAID 1, but can only suffer one disk failure just like RAID 1. Although striping improves overall throughput I am not fond of the whole striping thing. Just unnecessary complexity for most users. Yes, backups are required but to me, RAID 1 is simpler to maintain and recover. That all said, I have an opportunity with this Dell R710 to learn more about RAID and that was my hope with this thread.

Yes, it does take a long time...but honestly, who cares? Because:

The system is up
No data is lost
This happens in the background

For me, even if it takes several days, it's a non-issue. Because as you say, you have backups...and you're playing the odds too. Because what are the chances that two drives in one system/array are going to fail within a week of each other? RAID 5 and 6 are my go-tos, followed by RAID 50/60. I despise RAID1 because I've been bitten by the corrupted mirror numerous times. Yes, the drives are mirrored...but if one gets corrupted, it'll do nothing but corrupt the data on the second drive, leaving you dead in the water. Not to mention many times where you'd have to power the system down, put the mirror into primary slot, get a new drive, power up, and then wait xxx time for the re-mirror.

Have only had one instance of RAID5 going squirrely, and that was because of a flaky HP RAID controller (HP...they put the "J" in quality...)