LinuxQuestions.org - Storage approach for large project

- Linux - Server (https://www.linuxquestions.org/questions/linux-server-73/)

- - Storage approach for large project (https://www.linuxquestions.org/questions/linux-server-73/storage-approach-for-large-project-4175493822/)

Storage approach for large project

Hi, I'm planning on a project to provide e-mail and photo upload features and start up requirements is about 50TB of storage, growing to about 150TB in say about 6 months.

Additional requirement includes the following:
1) Ability to access the same storage from multiple web servers for load balancing on the web front. Guess I really need a file level storage solution rather than a block level storage solution. Therefore iSCSI is out because only one server can mount an iSCSI target at any time.
2) Ability to grow someone's quota at a much later time. So maybe a customer is allocated 1GB of storage today and a year from now I need to be able to expand to 5GB for him. Implication is this likely means the data will be stored across different storage nodes. Guess I'm referring to thin provisioning here.

Have looked at the following 3 options. Can someone provide me with some comments on which route I should take? Thanks.

NAS
Something like a 4x12TB one running on RAID6 will provide about 40TB of usable space. Think this is going to provide the densest TB per U on the rack and possibly the simplest option to start. But not sure if this is a sensible long term solution.

Hadoop
Know nothing about this technology aside than I'm supposed to be able to just continuously add in new server boxes to expand. But the complexity of setting this up and the need for 10GBps switches is a bit too much to invest in right off the bat.

Open Stack Storage
Info from the following appears to be painting a similar picture as Hadoop (ie no need for RAID and expand by plugging in additional boxes).
https://www.openstack.org/software/openstack-storage/

I have used glusterfs and it wasn't painful. :)

Habitual, thanks. I'm going to read up on GlusterFS. Can you give some background on what other storage solution you have used and what made you ended up deciding to use GlusterFS? I've also come to realize Windows Storage Server 2012 now supports thin provisioning and data de-duplication (space reclaim). So now we have the following list to choose from....
NAS
Hadoop
Open Stack Storage
ClusterFS
Windows Storage Server

Any one has experience in this space and can comment more?

Quote:

Originally Posted by grob115 (Post 5112515)

Habitual, thanks. I'm going to read up on GlusterFS. Can you give some background on what other storage solution you have used and what made you ended up deciding to use GlusterFS?

Well, being a Cloud Hosting Provider and in the Managed Servers business, Storage is not a problem for us and we have a few solutions. I installed GlusterFS just become acquainted with it's implementation of a distributed FileSystem.

All the assets that we manage are Virtual.
"NAS" - everything is a "NAS" these days, it's a rather generic term, IMO for large scale storage solutions.

Our next foray into your listed systems is OpenStack, but I believe we are going to have RackSpace roll it out for us and leave us to the administration end of it.
Windows? What's that?
We have a few Windows clients, most using AWS and their needs are minimal, so we utilize S3 storage (backups and snapshots) for those requirements.

GlusterFS fits your requirement here easily:

Quote:

Originally Posted by grob115 (Post 5112153)

So maybe a customer is allocated 1GB of storage today and a year from now I need to be able to expand to 5GB for him.

Here's some references, if you don't have them already for OpenStack:
http://docs.openstack.org/trunk/open..._decision.html
http://docs.openstack.org/trunk/open...hitecture.html

I hope that's helpful.

Okay I have been doing a lot of homework. Here's what I have come up with.
NAS
Hardware: QNAP TS-EC1679U-RP with 16GB RAM and E3-1225 v2 3.2 GHz processor.
Price: $11,063.00 for 16 x 4TB Hitachi Ultrastar Enterprise Edition disks.
Capacity: 64TB in a 3U space (ie 21TB/U), expandable up to 576TB in 27U with additional 16 bays units.

Pros
1) Mountable as iSCSI targets.
2) Running in RAID 6 mode for all 16 drives means I'm only using 12.5% of storage for parity. Also data is safe with failure in any 2 drives.

Cons
1) As I add more units, the read / write performance doesn't increase to cope with the additional data.
2) I'm not able to expand computational resources when additional rack space is consumed.

GlusterFS
Hardware: PowerEdge R720xd with 16GB RAM and E5-2603 v2 1.80GHz (note newer CPU but slower).
Price: $10,954.76 for 2 * 300GB 10K RPM SAS + 12 * 4TB 7.2K RPM SATA + 2*10Gbps NIC
Capacity: 48TB in a 2U space (ie 24TB/U), with unlimited upside in both capacity, performance, and computational resources.

Pros
1) Unlimited upside in both capacity, performance, and computational resources.
2) Slightly more space efficient than the QNAP NAS route.
3) No headache on managing RAID and having to ensure disk models are compatible with the NAS.

For a model such as the PowerEdge R720xd, I can technically build a Distributed-Replicate volume out of the 12 disks it has inside each unit to start because I wouldn't have more than 48TB of data to store to start so it doesn't make sense for me to buy a few R720xd at the beginning. So I'd imagine I need to do something like the following?

Create 3 volumes via the LVM. Here're some example commands for the first volume. Do the same for the second and third volumes with the remaining disks. Note I'm a bit rusty on this so let me know if the commands are wrong.

Code:

fdisk /dev/sda

fdisk /dev/sdb

fdisk /dev/sdc

fdisk /dev/sdd

mkfs.ext4 /dev/sda

mkfs.ext4 /dev/sdb

mkfs.ext4 /dev/sdc

mkfs.ext4 /dev/sdd

pvcreate /dev/sda

pvcreate /dev/sdb

pvcreate /dev/sdc

pvcreate /dev/sdd

vgcreate vg1 /dev/sda /dev/sdb /dev/sdc /dev/sdd

Create the GlusterFS with the 3 volumes, following a guide here.

Code:

[root@server1 ~]# gluster volume create test-volume replica 2 server1:/vg1/ server1:/vg2/ server1:/vg3/

Multiple bricks of a replicate volume are present on the same server. This setup is not optimal.

Do you still want to continue creating the volume?  (y/n) y

Creation of volume test-volume has been successful. Please start the volume to access data.

[root@server1 ~]# gluster volume start test-volume

Starting volume test-volume has been successful

[root@client ~]# mount.glusterfs server1:/test-volume /mnt

Then when I expand by adding another PowerEdge R720xd, I need to do the following, after creating another 3 volumes on the second box.

Code:

[root@server1 ~]# gluster peer probe server2

[root@server1 ~]# gluster volume add-brick test-volume server2:/vg1/ server2:/vg2/ server2:/vg3/

[root@server1 ~]# gluster volume rebalance test-volume start

Starting rebalance on volume test-volume has been successful

[root@server2 ~]# gluster volume rebalance test-volume fix-layout start

[root@server1 ~]# gluster volume rebalence test-volume migrate data start

Questions
1) With a RAID6 setup, it doesn't make any difference which specific 2 disks fail. But with GlusterFS, with replica=2 the same data is stored on 2 different locations (say server1:/dev/sda and server2:/dev/sdd). If we happens to be that unlucky to lose both server1:/dev/sda and server2:/dev/sdd then data loss will happen?
2) For RAID6, I have 10 disks of equivalent storage capacity on a 12 disks box. For GlusterFS with replica=2, I have only 6 disks of equivalent storage capacity on a 12 disks box? If yes this appears to be quite inefficient.
3) Is there a limit to the number of bricks I can add to a GlusterFS volume?

Check DDN Object Storage. Probably it is what you looking for, without inventing bicycle.

Quote:

Originally Posted by voleg (Post 5114380)

Check DDN Object Storage. Probably it is what you looking for, without inventing bicycle.

Okay but somehow I have a feeling that this is going to be a lot more expensive than just a $11,063.00 64TB QNAP NAS or a $10,954.76 48TB PowerEdge R720xd. I'll be pleased to be told that this is not true however.

Whenever I look at how Hadoop, GlusterFS, and OpenStack works, it appears they achieve redundancy by literally replicating data, instead of computing parity. Take Hadoop's default replica count of 3 as an example, this means if you have a 100PB setup, you are really only able to store only 33PB of data?! I don't understand how so many companies like LinkedIn, Facebook, etc are using this kind of setup and wasting so much space. But then again I'm not sure if I have understood this right.

Most of the large sites (well, supercomputer centers with large storage) use tapes to provide archives - usually stored redundantly (two or even three times).

gluster is used for speed - it provides very high read/write capacity, and the files get moved to a secondary storage for backup by the users (fairly large - 150-200TB); the data was archived to tape (700TB+) automatically in a HSM, and the disk space released for re-use. This provides long term storage, but fairly quick access to any file (10-15 seconds normally to retrieve any specific file, but if a LOT of historical data is retrieved at one time it does slow down).

The gain using the HSM is that you don't have to waste money on high speed storage - most of the data is historical, and only retrieved occasionally. In the cases where I was working, this was for weather data. When the scientists were validating new/updated models they would re-run predictions, checking against historical data for correctness, and continue to the present (and then make new predictions to be checked when reality showed up).

The problem with multi-TB disks and RAID5/6... is that the time it takes to rebuild a failed drive can be longer than the MTBF to the next failure. It can take several days to a week trying to rebuild a 20TB raid array. At least with RAID 6 you can have two disks fail before it becomes catastrophic. The last NAS I used with RAID 6 only used 150GB disks, recovery of one disk (while things were on-line) would take two days. Now we used multiple stripes of RAID 6 - so a single filesystem would be over 16TB, but each RAID group was only about 4TB.

The HSM approach is to mirror the data. Tapes are relatively cheap, and have better long term retention of data, so mirroring it worked better. Raid configurations were tried, but to restore data meant that you had to have everything available at one time - it was better to fragment the file and store each fragment on two or more different tapes. To get the file back only needed reading all of the fragments once to reassemble the file.

One fairly well known site for HSM use is the internet archive (https://archive.org/). Data that has been recently archived is available immedately. But if you ask for some really old archives, it may take a while (disk space has to be allocated, and to do the allocation may require archiving more recent data to tape first). But once they have been retrieved, the next access is quick.

You might want to take a look at ZFS as well. Depending on the performance requirements, a few disks enclosures with SSDs in front of them for cache (ZFS calls this the ARC) and for the ZIL might be a good architecture to consider. If you're looking for a fully supported solution, Nexenta has a product that can be deployed on commodity hardware, which is highly cost effective. The drawbacks are similar to the QNAP solution you mention above; you only have one head in front of the storage, so it limits you in terms of performance. However, if your IOPS requirements aren't too crazy, it could be a way to go.

jpollard thanks but tape can't do because I need all data to be online at the same time.
btmiller thanks for the referral. Any idea what Nexenta costs like?

Also I always hear people talk about Netapp. But what is special about them? Are they as economic as the QNAPs?

The problem with all the storage devices is cost. A netapp of the size you specify will be a lot because you won't be able to use just one... you would be using a cluster of between 3 and 5 of their largest just for the primary storage. The nice part of NetApp is that you can start small, and add units.

And don't discount tape. You still need backup. Even if you use a cluster, you would need at least two such clusters to provide a backup... Usually, a cluster provides its own redundant storage, so if you have to use 3 for the capacity, you will have to have an additional 3 for the backup. If you need 5 for the primary, you will need an additional 5 for the backup. And if you need it all online, you will have to have at least two centers, on different power grids...

Also don't overlook the UPS required. You cannot depend on commercial power directly - that would be a sure way to cause disk aging much faster than normal. It also causes multiple simultaneous failures.

And it is MOST unusual to require ALL data to be online ALL the time (especially with images). That usually only happens to database index files... HSM puts makes the data available - usually referred to as "near line", because you don't do anything special to the filesystems using it. Just access the file normally - the system automatically takes care of locating retrieving, and making it available.

Tape is far cheaper than a purely disk storage units. And it serves as a backup as well, and with proper tuning, almost as fast as disks.

Other things useful is that the power required is a lot less than half that required for disks. The only things requiring power in a HSM is the robot, and the transport drives. The ones I've worked with a single storage system held 9000 tapes, of which only the tape transports have to have power. Which would roughly correspond to your "online" version of 9000 disk drives, ALL of which would have to have power. In addition, the tapes only take wear and tear during actual I/O operations - the disks take wear and tear all the time, whether the data is being accessed or not.

If you are looking at commercial units, check with Oracle (or maybe http://www.sam-fs.com/ - though I'm not familar with how the Oracle/Sun purchase/merger handles things) - SAMFS/SAMQFS handles huge disk storage, and manages automatic tape support as well.

@jpollard I actually am doing backups at a different geographical zone now but yes you are correct in terms of costs of backups. I'm not resistant to tapes but I just have no experience with this and I'm not sure if it's possible to fit a HSM solution.

Guys found an interesting read of how Facebook does it.
https://www.facebook.com/note.php?note_id=76191543919

Note they are also using RAID6 as well within each box.

Quote:

Originally Posted by grob115 (Post 5115004)

That that LOOKS like an HSM. Using rather odd terminology though. Copying non-deleted data to recover storage is normal - it minimizes the activity on the data, and usually does it when space starts to run out, or on a scheduled period. I do know XFS supports a HSM underneath (when I saw it, it was called CXFS when the HSM hooks were active), I don't know if they are using that though (kind of doubt it, they have a tendency toward the NIH syndrome).

Hi, a couple of good reads... I like the first one the best and appears FreeNAS it the one I should look more in depth.
http://catn.com/2012/05/11/openindia...or-vs-freenas/
http://www.overclock.net/t/1307866/f...indiana-nappit
http://forums.overclockers.com.au/sh....php?t=1078674
http://storagebod.typepad.com/storag...ud-part-2.html

I also come across the concept of LTFS for being able to access files archived to tapes but appears as files like as if they are on disk.
jpollard, is this what you have been trying to tell me? I tried looking around online for costs of a solution like this but found no pricing information anywhere. If I want to store 1PB of data on tapes, what type of costs will I be looking at?

LTFS is just a way to store files on a tape. It does provide a directory, but access is always at tape speed, and is serial.

HSM uses disks as a cache storage for data that resides/will reside on tape. Access time for HSM depends on whether the data is already on disk - if it is, then access time is about the same as disk. IF the data resides on tape, then the access time is the time it takes to get the data back on disk.

LT05 tapes are around $30 or less, and hold between 1.5TB to 3TB per tape. A transport is around $2,700. I do not know what the librarian expenses are for a vault. A PB would be a little under 1000 LT05 tapes or about $30,000 for the tapes, 4 transports would add around $9000. Dell has a product, but for HSM use, you would have to check what software/filesystem is available. IBM, SGI (Rackspace) has commercial units, and there are others.

The silos I'm familar with were StorageTec (bought by Sun, then by Oracle) so I'm not sure what they are offering now, nor the current prices. A STK silo had room for 4/8 transports, and 7-9000 tapes (and the option to chain silos together for maximum of 8 silos. These are the large silos, and are roughly the size of 12'x12'x8' room. Once the tapes are in the silo, almost no human interaction is required (sometimes the robot arm would drop a tape though, so an operator does have to be able to go in and pick it up. Usually calls for service at the same time...).

Right now, for commodity offerings, I'd suggest checking with Dell and some other vendors about HSM solutions. I believe even NetApp has some HSM solutions.