directory attribute caching for NFS/GlusterFS

James259 · 09-03-2010, 09:15 PM

Hi Folks,

I do hope someone can help me a little here because I have been racking my brain for a good month over this now, trying various methods I found on google but have not found anything that does what I need yet.

Basically, i have a clustered filesystem using GlusterFS.
This is ultimately going to host a very large number of files.

It is mainly used as a storage destination for backups, and historical copies of files.

Remote servers sync using unison every few minutes.
A local script will run over the whole filesystem once per hour looking for new files/folders, and files that have been updated based on their timestamp.

99% of filesystem access is browsing the directory structure, listing directory contents and checking the modification times of files. Access to the actual content of a file is minimal. Only a tiny fraction of the filesystem is actually modified from hour to hour.

GlusterFS alone is quite slow when browsing the directory structure. (ie. "ls -Rl /data") The speed of things for actually transferring file content is sufficient for my requirements.

What I need is to vastly improve performance when running operations such as "ls -Rl /data". (/data is the mount point)

I believe the best way to do this is to implement caching.
The cache options within GlusterFS are simply not sufficient here.

My first thought was to re-export the GlusterFS mount with NFS, and then mount the NFS share and set the cache on the client to a very long expiry. (like 86400 = 24 hours) It is my understanding that any change made to a file using the mount point will invalidate the cache entry for that file. (it is only mounted in one place, so no changes possible at the back end.)

I did this using the kernel based NFS server, but ran into major problems with the "Stale NFS" errors which from reading is due to a problem related to FUSE that doesnt sound like its going to be fixed soon. Aside from the Stale errors, this did provide a suitable boost in performance.

I tried the beta of GlusterFS that has the integrated NFS server (so presumably, no FUSE) but I could not get this to compile properly on our servers.

Finally, I tried using the Gluster patched version of unfs3 that uses boost to talk to Gluster instead of FUSE. Now this works, but for some reason the NFS client cache doesnt seem to cache anymore.

I am no expert by any means so please dont assume I know the innards of FUSE, etc.

One last thing that I was looking at is the possibility of running a simple cache layer in front of either GlusterFS or NFS. I believe Cache-FS is the tool for the job but I have been unable to get that to work - I believe it is disabled in my kernel or something. (mount command says cachefs is unknown)

I am running Ubuntu 8.04 on most servers, but have upgraded one to 10.04 to try and get around Kernel limitations. My servers are all 32 bit (I know, not recommended for GlusterFS) and its very difficult for me to change this. (its a live system)

Is anyone out there willing and able to try and help me figure this out. I quite simply need to add a cache for the directory structure information, and then maybe export this with NFS so that it can be mounted on a *single* server. (the cache can be on the server where it is mounted if required, but due to the large size of the cache - it may be better to have a server dedicated for the cache)

I am running GlusterFS 3.0.5 in a replicate/distribute manner.

If more information is needed on anything, please let me know. Any suggestions would be very very much appreciated. I am not one for asking for help but really have run out of idea's now.

Thanks in advance to anyone who spent the time to read this.

James259 · 09-04-2010, 08:22 PM

Ok,

So I manually rebuilt the kernel and got FSCache working with NFS on a test box.
Seems to work a treat for files but it does not appear to cache directory structure.
(I come to this conclusion here because the cache itself is not getting any bigger when it should be caching me doing a du on the kernel source.)
Since this is a single server hosting both nfs server and client, I cannot really test actual performance - but I will tomorrow.

Idea's anyone?

PS: Since I have now established this has not broken my test box, I have the new kernel compiling on the live server - where i can actually see if its making any difference in performance. I will post my findings but not hopeful since the FSCache folder does not seem to be growing on my test box unless I actually open the contents of a file.

jburnash · 09-22-2010, 02:14 PM

Hi James.

Did you get any further on this? I'm actually interested in this as well. I am currently tuning a NFS server that is re-exporting the Glusterfs storage to clients on a different network.

I don't have quite the same scenario, in that I'm not doing a lot of stat calls (though there is a stat-prefetch translator for Gluster that I haven't used yet), but I'm trying to figure out if there is some magic ratio of NFS cache to Gluster client cache on the NFS server that will give me the best performance possible.

Any ideas would be appreciated.

James

James259 · 09-23-2010, 09:05 AM

Hi James,

In short, no. I eventually dumped GlusterFS. Its a fantastic idea but for me the performance was shockingly bad and the client side caching was seriously inadequate.

In their defence however, GlusterFS can be mounted on multiple hosts and under such a scenario, extensive caching would be very bad. It just so happened that in my scenario I actually wanted to use extremely heavy caching and while gluster had the translators to do this sort of thing, they were seriously limited.

If I had the time and skills I would have written/upgraded the translators to accommodate what I needed since I think the concept of the product is awesome.

I did manage to get the level of cache that i needed by re-exporting GlusterFS with NFS and then setting actimeo=<high number> when the NFS share was mounted. This did what I needed but issues relating to 'Stale NFS Handle', which are apparently an issue relating to the use of the FUSE library, made it completely useless. (you would basically end up in a situation where random files are inaccessible for random periods)

I tried avoiding using the FUSE library by using the patched unfs+booster supplied by the glusterfs team. This overcame the 'Stale NFS Handle' issues but for some reason, all caching stopped working on the NFS client when mounting a share exported with unfs. After lots and lots of reading I found something that suggested there was an issue with inode numbers being passed from a userspace nfs daemon properly and unfs uses some form of compressed path string instead, which apparently breaks the client side cache.

At this point I gave up and went back to the drawing board.

I have implemented a solution that is performing pretty well. To some people (me at least) it would also be considered a surprisingly simple solution.

What my requirements were is this:

Safe Storage (IE. 2 separate copies of data on 2 physically separate servers. mirroring across 2 disks in the same server was not good enough)
One file system to span many disks.. with the ability to add disks as required to expand available space, and also retire them without losing data.
Provide sufficient caching to accommodate my heavy file system scanning

My solution was (i think) incredibly simple. Here is what I did.

I started with two servers, each with a large matching sized hard disk partition to be used for storing data.
I used AoE (Atapi over Ethernet maybe? cant remember the exact term) to export the physical disk from each server.
I then enabled AoE on a third box which I call the master box. (basically, where the filesystem is to be mounted) By doing this, and allowing AoE to scan the network, the two exported hard disks from other servers appeared as local devices.
I then used the standard linux software raid tools (mdadm?) to make a raid 1 mirror from the two disks, which is then exported as something like /dev/md0
Then I setup LVM and used /dev/md0 as my first PV.
After getting my LV setup, I finally used ext3 filesystem and mounted the result under /data
Now, linux has standard cache functions built in for local disk access and uses up any spare ram to cache whatever it can. (think its like a first in first out job)... so, we just threw lots of RAM at the server and as a result got quite an extensive cache. (even with 1Gb of ram, it did what we wanted with the test data we had)

the rest is pretty simple.. for extending the storage space we just add two more servers with disks, raid-1 them with mdadm to make /dev/md1, get the disk marked as a LVM PV and use it to extend the LV. Then, resize2fs extends the ext3 filesystem to take up the extra space. all of this without even having to unmount the filesystem.

it is also possible using LVM to move the contents of one device to another empty device with an equal amount of space or more. this allows us to retire old servers as new ones are added where required.

I considered striping but considering the way the data we are storing is read/written and the way we are accessing it - the data gets distributed nicely across the disks anyway so we get a reasonable good striping effect.

There is no reason why you could not just export it with NFS and mount it elsewhere - on multiple systems if needed.
Obviously - one major drawback this method has compared to gluster is that this method has a single point of failure, and single point of congestion - the master box. This is exactly what I wanted in this case but might not be what you want.

Anyway, that was my epic journey. I hope it is of some use to you, or some other weary person who consumed far too much coffee trying to find a solution for something like this.

I would just like to re-iterate to anyone reading that GlusterFS is a fantastic product and while it did not quite suit me - is very much worth a good look.

They have a version of GlusterFS in beta with NFS server built in.. Might be worth a try out when its released. I tried the beta but it kept crashing in seconds for me.

James259 · 09-23-2010, 09:10 AM

Oh, if you are looking for a file cache for the client side of your nfs mount - that FSCache is a nice bit of kit. It only appears to work on file data and does not seem to cache attribute data but if this is good for you then I strongly reccommend having a look at it.

NB: The lack of attribute caching is a limitation of the 'fsc' implementation within the standard nfs client. From what I read, FSCache itself supports attribute caching in at least 2 different ways.