How to use compute-nodes as NAS?

AlexGg · 12-18-2014, 04:03 PM

Anyone has a clue how to use all compute-nodes as NAS with kind of virtual "shared" disk? I saw in the menu (after insert-ethers) it has NAS Appliance. Anyone uses that?
Also, I heard that hadoop is also great candidate for Rocks. Anyone did play with it?
Lastly, Lustre is some of interesting stuff for shared FS too.
Any information that help would be appreciated.

ttk · 12-20-2014, 09:49 PM

I like using GlusterFS for Linux NAS. It's free, really trivial to make work with Linux, performs better than NFS, and allows for very flexible distributed data storage.

http://www.gluster.org/

No need for special support in the kernel or dedicated filesystems. Just create subdirectories anywhere you want data to be stored, tell gluster those subdirectories are "bricks", turn one or more bricks into a "volume", and use CIFS to mount the volume on all of your compute nodes.

AlexGg · 12-21-2014, 10:41 PM

Thank you!

How does it use node's hdd?

ttk · 12-22-2014, 01:58 PM

It depends on the kind of volume you make from the bricks. The upcoming release will support Reed-Solomon, but in the meantime it is limited to "mirrors" (like RAID1), "stripes" (like RAID0), and "stripes of mirrors" (like RAID10).

So if you have four computer nodes: A, B, C, and D, and on each of them you mkdir /data and tell gluster to make /data a "brick", you will have four data bricks: A-/data, B-/data, C-/data, and D-/data.

If you create a stripe volume "/space1" from all four, then some data will be written to A-/data, other data to B-/data, other data to C-/data, and yet other data to D-/data. Thus, if you write 100MB to file /space1/foo, then about 25MB of "foo" will be written to each brick. Reading "foo" will read different data from each brick concurrently. If you lose one brick, though, you will lose the entire volume.

If you create a mirror volume "/space2" from all four bricks, then the same data will be written to all four bricks. This makes for slow writing, but fast reading, and you can lose any number of bricks and still not lose your volume, as long as you have at least one.

If you create a stripe of mirrors volume "/space3", where (A-/data, B-/data) is one mirror and (C-/data, D-/data) is the other mirror, then some data will be written to A and B, while other data will be written to C and D. In this case, when you write 100MB to file /space3/foo, then 50MB of foo will be written to A-/data, the same 50MB will be written to B-/data, and the other 50MB of foo will be written to C-/data and D-/data. In this way reads and writes are fairly fast, and you can lose up to one of A or B, and one of C or D, without losing the entire volume.

Also, if new nodes come online and you want to create bricks on them and add them to your volume, GlusterFS allows for this. If, for instance, you have a 21-brick volume organized as seven three-brick mirrors in a seven-mirror stripe, you could add another three-brick mirror to make an eight-mirror stripe, and then tell GlusterFS to "rebalance" the stripe (which may impact performance, but does not require the volume to be taken offline). In this way you can increase the amount of space available in the volume over time.

Each brick's data is stored in the directory they were made from (/data, or /var/brick, or wherever you like).

The next release of GlusterFS will support Reed-Solomon redundancy, for RAID5-like and RAID6-like organizations of bricks into volumes. I haven't tried it yet, but look forward to doing so.

AlexGg · 12-22-2014, 02:06 PM

Wow Great. Thank you for the detailed explanation. Will play with that. So I suspect when say some node is offline (e.g. maintenance), how GlusterFS will act then in mixed configuration?

ttk · 12-22-2014, 03:21 PM

When you take a node offline, then if the volume has enough redundancy (via mirroring), the volume will continue to be usable. There is a commandline tool ("gluster") which can tell you which bricks are missing from a volume. Remote users accessing files on the volume can keep accessing them like nothing is wrong.

See this blog entry for an example of the process for taking a brick out of a mirror and putting another one in its place:

http://blog.angits.net/serendipity/a...-is-replicated

/dev/random · 12-22-2014, 03:34 PM

There is also Ceph + ZFS

Pros:
RAIDZ2
Deduplcation
Self-Healing (no more bit rot)

Cons:
Needs alot of RAM

AlexGg · 01-09-2015, 12:24 PM

I am going to play with BeeGFS (formerly Fraunhofer). Anyone did use it in Rocks?