RAID grow time?

DavidDiggs · 11-08-2007, 12:28 AM

I'm looking to make a home server starting with 6x1TB hard drives.

The motherboard I'm looking at has room for 12 sata drives which gives me plenty of room to grow.

My question is if I have a 4/5TB raid array in raid5/6 how long will it take to grow the array if I add a single drive?

macemoneta · 11-08-2007, 10:23 AM

About 3 seconds, if you use LVM2.

complich8 · 11-08-2007, 10:50 AM

And how do you propose implementing raid5 using lvm again? Correct me if I'm wrong, but afaik LVM provides an abstraction layer on top of the disk, but it has no RAID capabilities. You need mdadm for that.

A friend of mine recently added another pair of 500 gig drives to his previously 6x500 disk array (so 2.5TB -> 3.5TB), and it took his system about 36 hours to resize the array. I've heard of results like ... 3x400->4x400 in 22 hours. I've also heard of faster results, like 3x320->4x320 in 7 hours. It all depends on the disks, the controllers, the disk activity and the data.

The bigger the array the longer it'll take to resize, since it's basically got to safely rearrange the entire array, which entails copying the entire array a row of blocks at a time, first to empty space, then to the new array space (recalculating the new parity along the way).

For something in the multiple terabyte range, if the array's full and the controllers are about average, I'd estimate 36-72 hours of rebuild time.

Granted, resize time and downtime are two different things ... afaik you can grow a raid "hot" (ie: mounted, active).

macemoneta · 11-08-2007, 01:57 PM

LVM2 provides both RAID0 and RAID1 functionality, but what I was referring to is using multiple RAIDed sets and adding the md devices to LVM2 as physical volumes. You should consider using smaller (not more than 250MB) RAID units, and aggregating the space with LVM2. RAID recovery on 6TB will probably take a couple of days otherwise.

complich8 · 11-08-2007, 03:17 PM

... I'd like to hear how exactly you propose laying something like that out... like, in a real, technically informative way.

macemoneta · 11-08-2007, 10:12 PM

For your 6x1TB drives, 4 partitions per drive, for example:

sda1, sda2, sda3, sda4
sdb1, sdb2, sdb3, sdb4
sdc1, sdc2, sdc3, sdc4
sdd1, sdd2, sdd3, sdd4
sde1, sde2, sde3, sde4
sdf1, sdf2, sdf3, sdf4

Create 4 RAID5 arrays:

md0: sda1, sdb1, sdc1, sdd1, sde1, sdf1
md1: sda2, sdb2, sdc2, sdd2, sde2, sdf2
md2: sda3, sdb3, sdc3, sdd3, sde3, sdf3
md3: sda4, sdb4, sdc4, sdd4, sde4, sdf4

Each md becomes a pv for LVM2:
pvcreate /dev/md0 /dev/md1 /dev/md2 /dev/md3

Create a volume group from the physical volumes:
vgcreate raidpool /dev/md0 /dev/md1 /dev/md2 /dev/md3

You can now add logical volumes with lvcreate. When you add an additional array you can add it as a physical volume, then extend the volume group with vgextend. You can use the added space to add additional logical volumes, or grow existing volumes into the space with lvextend.

It's not as complex as it seems. If you have a recent Fedora, you can use the system-config-lvm GUI to make the management easier.

complich8 · 11-09-2007, 12:39 AM

the thing is ... that's sort of ... pointless.

See, if a disk fails, you're still degrading the _entire_ volume group when it fails, and you're still rebuilding the _entire_ volume group when you replace the disk. Just that instead of recovering a single MD, you're rebuilding four of them. The minimum unit of failure in an array isn't a partition, it's a disk -- you can't really say "oh, /dev/sdd3 failed, I'll swap it out" -- all you can say is "oh, /dev/sdd3 failed, I'll swap /dev/sdd out". I mean, you could leave your failed disk in place and rebuild the array on another partition on a new disk, yes, but you'd still be relying on a failing disk, and ultimately have to replace it and rebuild the whole array.

The idea makes sense in the enterprise world, where each of those slices is actually a disk, because then the model fits a minimal failure nicely. So if I had an array of 4 trays worth of 10 disks each, for example, I'd raid5 all the disk 1's, all the disk 2's, all the disk 3's etc on a given then use lvm with the md's as my pv's in a single vg with striped lv's on top of it. If a disk fails, I replace the disk and rebuild the md it failed out of, and lvm doesn't care. If a controller fails, I'm rebuilding the whole array (and again, lvm doesn't care). But if I'm adding another tray, then I'm still rebuilding the entire array.

In the home environment, it's less likely that you're going to go out and add another array worth of disks, and more likely that you're going to say "hmm, I'm out of space, I'll add another 1TB disk to the array". With mdadm, afaik growing an array is an online operation, so while it'll take a while (and longer if you're actively using the disk), there's not a whole lot of real drawback to adding disks and growing the array as you need it to grow.

LVM and raid are a good match for certain purposes. But applying them without thinking out what you're trying to accomplish and whether you're actually accomplishing it with the way you're applying them is counterproductive. LVM is a very good hammer, but often the piece of metal you're interested in embedding in the wall is a screw.

Electro · 11-09-2007, 01:20 AM

I suggest put two drives in RAID-1 using mdadm and then add it to the LVM2 or EVMS. Repeat the process for all three sets. This provides more redundancy than RAID-5, but provides you added safety at a cost of space if a hard drive fails and you do not have to worry too much of backing it up. Though backing up should be included in your setup too.

I recommend RAID-6 instead of RAID-5 if you want more space than the setup that I suggested.

complich8 · 11-09-2007, 01:34 AM

I think RAID6 is a better solution than that for more than 3 or 4 drives though.

With paired RAID-1's, if a disk fails, you're fine, but if both disks in the pair fail, you're hosed. So the number of disks that have to fail for your day/week/month to be ruined is two. If you take it as a given that two disks are going to fail at random, your odds are better for three raid-1's than two raid-5's (40% chance of a crappy week for raid5's, 20% chance of a crappy week for raid1's).

With RAID6, your survivability is two disks, regardless. Any two disks in the array fail, and you're still only degraded/recoverable. IE: given two random disk failures, your chance of a crappy week is still zero.

In other words, while your RAID-1's are more redundant, the RAID6 is more resilient to moderate disasters (ie: two disks failing, noting that if 3 disks fail at random, the raid5's/raid6 are definitely screwed, and you're at like a 40% chance of being ok on the raid1 sets ... but if that third disk dies, you were probably going to have a really crappy week regardless).

macemoneta · 11-09-2007, 10:12 AM

Quote:

Originally Posted by complich8

the thing is ... that's sort of ... pointless.

See, if a disk fails, you're still degrading the _entire_ volume group when it fails, and you're still rebuilding the _entire_ volume group when you replace the disk.

It's not pointless. Let's say that a disk fails. All the md devices associated with that drive are degraded. You replace the disk. The time to resync the 6TB array is 48 hours. Resync occurs one md at a time, so md0 will be synced in 12 hours. If you put your most critical data on md0 its risk window for secondary failure has been reduced from 48 to 12 hours. More segmentation of the array means faster return to redundancy for the first md device.

In addition, let's say the sync is interrupted - the system fails. some number of the partitions have already been synced, which acts as a checkpoint restart when the system recovers - only the remaining md devices will need to have their sync restarted.

The use of LVM2 addresses your initial question, allowing expansion of the available space without the long RAID grow process.

It's not pointless; there's a reason these steps are used in enterprises, where downtime is an important measurement. That's why you use RAID to begin with - downtime avoidance.

complich8 · 11-09-2007, 01:08 PM

Quote:

Originally Posted by macemoneta

If you put your most critical data on md0 its risk window for secondary failure has been reduced from 48 to 12 hours. More segmentation of the array means faster return to redundancy for the first md device.

The assumption here is that your most critical data is, in fact, on md0. But you're using LVM to abstract all the md devices to a single logical volume. You can't make that assumption.

And as far as I know, when an lv loses a member vg or a vg loses a member pv, the whole thing ends up in a bad state. I haven't experimented with this very much, but I understand this to mean that when your second disk fails after md0 is rebuilt but before md1 is rebuilt, you've still got a corrupt logical volume and you've still got data loss. What's worse, it seems like you've got a more interactive rebuild, such that you're going to have to wait until the first array's done rebuilding to start the second -- meaning sitting there babysitting the recovery, or having significant gaps between volumes (implying an increased exposure window).

Or does mdadm have some queueing mechanism of which I'm unaware?

macemoneta · 11-09-2007, 02:03 PM

This gets more into data policy management than data management, but if you have data of differing importance, you set up volume groups for the priority your policy defines to reflect the return to redundancy time.

For example,

vgcreate critical /dev/md0
vgcreate important /dev/md1 /dev/md2
vgcreate lowpriority /dev/md3

You can then allocate logical volumes from the volume group that defines the policy associated with the data.

When using md devices for PVs in LVM, LVM will not see a RAID failure - that occurs at a different level. The only time the PV would be impacted is when a multi-drive failure occurs in a RAID1 or RAID5 (i.e., the data is no longer accessible). So the VG is never "lost", and the defined LVs are not impacted. There is no babysitting involved; the md recovery is automatic once the physical device is replaced, and the LVM2 configuration does not need any recovery or intervention.

The loss of redundancy does not impact the data availability of a md device. You still have read/write access to the data. The risk is that a secondary failure might occur between the primary failure and the return to full redundancy. There are multiple ways to address that, depending on the availability requirements. Everything from higher levels of redundancy (RAID10, RAID50, RAID6), to the use of segmented RAID devices as described above, to the creation of a new array restored from backup - which must exist as a fall back in any case, as RAID provides no archival capability.

Since you are asking about a home server, the higher levels of RAID redundancy are neither cost effective nor justifiable. Relying solely on backup means higher risk and longer recovery. Segmenting the array involves no additional cost, but comes a higher management overhead. Whether that's justifiable in your environment is your call. However, if you have 6TB of data you are keeping online, that exceeds the typical configuration and justifies the additional diligence in my mind.

complich8 · 11-09-2007, 03:50 PM

Quote:

Originally Posted by macemoneta

This gets more into data policy management than data management, but if you have data of differing importance, you set up volume groups for the priority your policy defines to reflect the return to redundancy time.

For example,

vgcreate critical /dev/md0
vgcreate important /dev/md1 /dev/md2
vgcreate lowpriority /dev/md3

But that's not what you suggested at all. You said "make a bunch of small partitions, then throw them all in one VG called raidpool and make an lv on top of them". My claim is that in such a setup, segmenting using arrays of partitions nets you zero gain. Changing the problem's conditions changes the question.

Quote:

When using md devices for PVs in LVM, LVM will not see a RAID failure - that occurs at a different level. The only time the PV would be impacted is when a multi-drive failure occurs in a RAID1 or RAID5 (i.e., the data is no longer accessible).

This is true whether you have one md in the pv, or a dozen. Either way, if you have a raid5 and you lose a second disk in it while you're recovering from the first disk failure, you lose data. And if one of the md's in your near-capacity lv is hosed, your lv is also hosed -- at least inasmuch as you lose data from it.

Quote:

Relying solely on backup means higher risk and longer recovery.

Home users don't make backups. Facts of life. Especially not for big chunks of data like video files.

Quote:

Segmenting the array involves no additional cost, but comes a higher management overhead. Whether that's justifiable in your environment is your call.

But again, my claim is that segmenting the array using the schema you specified, only to dump it all together into a single logical volume gains you nothing over not segmenting it at all. You're redefining the problem to fit your answer and redefining your answer to fit the redefined problem, but my initial claim still stands -- if you're going to make it into a single logical volume and your goal is to resist disk failures, then you gain nothing in the direction of your goal by partitioning your disks into tiny slices and making raids of the slices. Because if a partition fails, the whole disk has failed.

Quote:

However, if you have 6TB of data you are keeping online, that exceeds the typical configuration and justifies the additional diligence in my mind.

You may need to update your definition for "typical", given how commonplace it's becoming for people to build hd-pvr's. Capping using mpeg2 (for realtime encoding capability), you're looking at something like 80 mbps of data ... which is like 36GB/hr. At that (excessive) rate, you'd only be able to cap about 80 hours of 1080i tv on a 5tb array. For a more realistic example, my roommate's Heroes hd captures come in at about 18 gigs per episode -- which means 160 hours of tv in 5 tb at the bitrate he's capturing in. Which is a fairly reasonable expectation for a PVR device, especially if you're retaining files for a long time or buffering for a binge-watch at a later date.

macemoneta · 11-09-2007, 04:05 PM

I'm trying to answer questions as they arise in as general a fashion as possible. No one can create a configuration, without knowing everything the original poster knows.

Is the 6TB a porn collection? A collection of ripped CDs/DVDs? TV videos? Irreplaceable digital photos, home videos, doctoral thesis and supporting data? Without knowing what is going to be stored and the subjective importance assigned, a specific recommendation is meaningless.

Not backing up 6TB of data means that the data has little or no value. That's fine if it's a news server where the data will repopulate, but I wouldn't want to spend a chunk of my life recreating a doctoral thesis or say goodbye to the only record of my children.

You don't seem to be adding to the discussion, just playing devil's advocate to no specific purpose. Please stop.

complich8 · 11-09-2007, 04:38 PM

Quote:

Originally Posted by macemoneta

You don't seem to be adding to the discussion, just playing devil's advocate to no specific purpose. Please stop.

Nope. Just saying that your initial, glib suggestion is, in fact, wrong and didn't answer the original poster's question. And your subsequent answer is similarly flawed. In neither case did you address the actual question. Ever.

I did, however, answer the question. Back there in post number 3, with several examples.

The rest of this discussion is pointless and entirely off-topic. Admittedly, this is my fault, because I engaged you thinking you knew something about an LVM feature that I hadn't heard about, rather than just ignoring your bogus response and answering the question directly.