Looking for FS recomendations

usao · 01-04-2016, 11:09 AM

I need to implement a filesystem with the following criteria:
1) Need to grow (over time) to 100Tb.
2) Need to dynamically grow/shrink FS while FS remains online/active.

The reason for #1 is due to a application limitation which stores all it's data in a single location, and doesn't allow me to scale to multiple directories (or filesystems).

The reason for #2 is that the 100Tb will take a while to collect/implement, meaning that I will be adding/removing volumes over time as I grow this. Im expecting to have to shrink in order to replace physical volmes due to failures or just moving to larger drives over time, I will want to remove the smaller ones.

Habitual · 01-04-2016, 11:32 AM

Glusterfs?

usao · 01-04-2016, 11:58 AM

Quote:

Originally Posted by Habitual

Glusterfs?

Thanks, but a networked filesystem won't provide the throughput we are aiming for.
Currenly, I have 20Tb of SSD, which is giving me approx 2GByte/sec throughput.
We are hoping that we can implement a physical layer using 10K SAS drives, and have enough of them to create Raid-6 groups and stripe the groups to gain performance on spindled media. Going to 100Tb of SSD is just not in the budget, so it will likely be a mix of SSD and SAS 10K drives. Were planning on using CentOS embedded MD/LVM to manage the physical volumes, and allow for movement of the devices as we upgrade performance over time.
This is for a computational system, an HP DL980 with 64 physical cores and 512Gb RAM.

robertjinx · 01-04-2016, 03:00 PM

LVM with XFS?

Timothy Miller · 01-04-2016, 03:02 PM

I'd suggest LVM w/ whatever. XFS has advantages, but I'm sure most others can be grown and shrunk on a logical volume (I know ext4 & reiser can be), and you can always add additional drives to the logical volume as you need more space.

usao · 01-04-2016, 03:38 PM

There must be some trick with ext4, because it caps out at 16Tb. I think some 64-bit tools are needed, but not sure how to go about that.
Is XFS allowd to both grow/shrink while online?

Timothy Miller · 01-04-2016, 03:53 PM

I've never actually TRIED it, but it's supposed to be able to from what I can recall.

suicidaleggroll · 01-04-2016, 08:40 PM

I always use XFS on my big arrays, works well. I don't believe it can be shrunk though? However, why do you need to shrink? You can just build the filesystem the full size and let it run, you said it's only going to be in service a few months correct?

robertjinx · 01-05-2016, 01:14 AM

XFS with LVM can grow, but I wouldn't know about shrinking.

fmattheus · 01-05-2016, 02:22 AM

Did I read this correctly that all files will be stored in one directory? Many filesystems become very slow with large numbers of files in one directory as well. Something else to look at.

usao · 01-05-2016, 07:00 AM

The program is a mathematical simulator (N-Body), which isn't normally designed to run with such large numbers. That's why it wasn't designed to split up the data. I don't have the source, so re-doing the internals isn't an option. It stores mostly large files, so there won't be that many of them, just large because we are trying to really scale-up what the simulator can do well past what the initial design was for. I expect just a handfull of files, each multiple TB in size.
As for shrinking, the reason for that is we expect to have to remove existing LUNs and replace them over time, as the expected runtime of the simulation is well over a year to 2 years. Naturally, I could kill and restart the simulator and it will pick-up where it left off as it stores the intermediate results at each step, however, since the steps themselvs will be pretty long due to the size, we wanted to be able to allow for both growing and shrinking while online so that we can replace smaller drives with larger drives over time, as well as move from HDD to SSD over time as budget allows to improve speed/performance.

fmattheus · 01-05-2016, 07:19 AM

With ZFS you can have, for example, mirror pairs. When you need more space you replace 1 disk with a larger one, wait for the sync, then replace the 2nd, and it has automatically grown. I believe you can take pairs out as well, but not 100 sure there.

You can of course do the same with groups of RAIDZ2 disks as well instead of mirror pairs ... as you wish.

suicidaleggroll · 01-05-2016, 08:54 AM

Quote:

Originally Posted by usao

As for shrinking, the reason for that is we expect to have to remove existing LUNs and replace them over time, as the expected runtime of the simulation is well over a year to 2 years.

With any hardware RAID you can hot-swap drives while online, and as long as it's a redundant style (anything but RAID 0) you can do so without shrinking the array. Pull out the bad drive, put in the new, rebuild, move to the next. Once all drives are larger, the array can be grown, and then you can expand the filesystem to match, thought I don't know about growing it while online.

All that said, I still don't understand why you can't just build the array once to meet your size and speed requirements and then leave it alone. Why do you need to keep futzing with it during the run? You can easily hit that size/speed today, there's no need to compromise. With a 24-drive array you'll probably only lose one drive in the next 18 months, hardly enough to worry yourself about. Just buy a couple of extras and swap them in as needed. Besides, how many developments do you honestly expect to see in the HDD/SSD world in the next 18 months that would warrant replacing/rebuilding the entire array? I expect the largest HDDs to grow from 8 TB to 10 TB, and I expect to see 2 TB 2.5" SSDs drop in price slightly. Neither of these are game changers for your application. Development in the world of storage is slowing down significantly. HDDs are pretty much stagnant, and SSDs are getting faster but not a lot larger. 18 months won't see any big changes. If you can't hit your speed/size/budget requirements today, you won't be able to hit it 18 months either, you'll just complicate matters trying to screw with it while it's running.

Ignore shrinking/growing while online, just focus on building a reliable, fast array that can do what you need it to do for the entire project, and then keep an eye on maintenance.

Just my $.02, from somebody who has been doing this (or something very similar to it) professionally for the last decade.

usao · 01-05-2016, 09:12 AM

Quote:

Originally Posted by suicidaleggroll

With any hardware RAID you can hot-swap drives while online, and as long as it's a redundant style (anything but RAID 0) you can do so without shrinking the array. Pull out the bad drive, put in the new, rebuild, move to the next. Once all drives are larger, the array can be grown, and then you can expand the filesystem to match, thought I don't know about growing it while online.

All that said, I still don't understand why you can't just build the array once to meet your size and speed requirements and then leave it alone. Why do you need to keep futzing with it during the run? You can easily hit that size/speed today, there's no need to compromise. With a 24-drive array you'll probably only lose one drive in the next 18 months, hardly enough to worry yourself about. Just buy a couple of extras and swap them in as needed. Besides, how many developments do you honestly expect to see in the HDD/SSD world in the next 18 months that would warrant replacing/rebuilding the entire array? I expect the largest HDDs to grow from 8 TB to 10 TB, and I expect to see 2 TB 2.5" SSDs drop in price slightly. Neither of these are game changers for your application. Development in the world of storage is slowing down significantly. HDDs are pretty much stagnant, and SSDs are getting faster but not a lot larger. 18 months won't see any big changes. If you can't hit your speed/size/budget requirements today, you won't be able to hit it 18 months either, you'll just complicate matters trying to screw with it while it's running.

Ignore shrinking/growing while online, just focus on building a reliable, fast array that can do what you need it to do for the entire project, and then keep an eye on maintenance.

Just my $.02, from somebody who has been doing this (or something very similar to it) professionally for the last decade.

Mainly the reason is money. We can't wait to build the 100Tb before starting. Right now, I only have 20Tb available. Because of the very long runtime, we want to get it started now rather than waiting 6-9 months before we can get all 100Tb online and in it's final form.
the storage will be evolving, from cheap/crappy disks to enterprise SSD, but we can't do that all at once. We need to get started soon.
As for the RAID, im not sure how you can do that (changing number/sizes of disks) on a live raid group. I know I can replace with an identical drive, but that's not the goal, the goal is to both enlarge and upgrade as we move forward. When we start, we have some 6Tb WD RED drives, but will be moving to 2Tb SSD and 2.5in form-factor drives with SAS 10k. The storage needs to evolve in both size and shape over time. Im not aware that you can do anything other than "replace" an existing drive in a Raid Group. We were expecting to be creating new raid groups and removing old raid groups as they age/fail rather than replacing the older drives with identical ones.

suicidaleggroll · 01-05-2016, 10:10 AM

You can certainly swap in larger drives in a RAID. You won't gain anything immediately, but once all of the drives have been swapped in the RAID volume can be expanded to the new size, followed by the partition, followed by the filesystem. You can also add drives to the array on the fly, depending on the array type:
http://ask.adaptec.com/app/answers/d...axview-storage

I've never done this, and I would be seriously hesitant if the data on the drives wasn't backed up, but apparently it's not difficult.

This whole approach screams "bad idea" to me though. You're planning on continually expanding, rebuilding, modifying, swapping, expanding again...an array that is live, in use, under heavy I/O, and doesn't have a backup. Are you sure you can't afford the ~11k that would be required to build a 120TB system off-the-bat (plus cpu/memory, but I don't know your requirements there)? If you can't afford that for another 6-9 months, you might reconsider what you're doing here and think about scaling back. I mean how can you expect to build 100 TB out of 2 TB SSDs and 10k SAS drives any time even remotely soon on that kind of budget?