LinuxQuestions.org - server side file system replication over WAN

- Linux - Server (https://www.linuxquestions.org/questions/linux-server-73/)

- - server side file system replication over WAN (https://www.linuxquestions.org/questions/linux-server-73/server-side-file-system-replication-over-wan-840056/)

server side file system replication over WAN

I need to have synchronised filesystems on two servers connected via VPN tunnel over WAN. Both servers run Samba with replicated settings. I've been looking for a solution to replicate also the files served by the servers for quite a while. It should do replication in both directions close to realtime without the need of a high bandwidth connection.

I have found many solutions for clustered filesystems or replication: Coda, Lustre, Intermezzo, DRBD, OpenAFS, Ceph, Hadoop, GlusterFS Unison and others. None of them seems to meet my needs, some are only HPC clusteruing solutions, others are focused on client caching and offline editing.

There is promising solution under development: XtreemFS. But its still under development, the latest version only supports read replication up to now.

Has anyone a suggestion or solution?

I've been looking for something similar for a while, although I don't wish to use a distributed filesystem.

I have two zfs servers between two sites and want data to be synchronised both ways (changes made on both sites). It's only a 1mb connection so for large files accessing over the link is not practical, there needs to be local copies asynchronously replicated. I tried an rsync set-up, but that wasn't suitable. I was going to try a Unison based system as this seemed the only vaguely suitable tool available (even though it's not under active development).

Did you get anywhere with your system?

Quote:

Originally Posted by kuntergunt (Post 4137209)

I have found many solutions for clustered filesystems or replication: Coda, Lustre, Intermezzo, DRBD, OpenAFS, Ceph, Hadoop, GlusterFS Unison and others. None of them seems to meet my needs, some are only HPC clusteruing solutions, others are focused on client caching and offline editing.

That's a lot of time w/ Google or Wikipedia for each of your readers to look up each of them, how about providing links so we can better understand what you want & why these are not appropriate.

my needs and a list of possible solutions

I have already spend quite some time on investigations. It is hard to find details for most of these solutions, whether they meet (some of) my requirements or not. I have found endless feature lists but most of them (except DRBD and XtreemFS) have no use cases. They don't say what they are good for and what they cannot do. The term "distributed filesystem" has a lot of flavors and attributes: performance/load balancing, fault tolerance, replication, multi master/single master, low bandwith support, posix compliance, kernel integration/FUSE, installation of client software needed, ...(many more).

What I would like to have, if I can get it:
If I open a file on one of 2 or more subsidiaries, I want it to open immediately (local copy).
If I open it for writing, it should be locked on all other locations that are online (read access available).
If I close the file after editing, it should be replicated to the other servers in background (Maybe locked on all servers until replication is finished).
If one server goes offline and there is no replication, all files are still available for read/write access on any server.
If the server goes online again, an automatic sync of the changes files occurs. If changes on both sides are detected, either the last change wins or a copy of each changed is stored (similar to Dropbox).

Right now I am thinking about using Unison with scheduled replication or wait for XtreemFS to support write replication.

my investigations (unsorted):

Intermezzo
development stopped

PVFS
load balancing, parallel I/O, used for HPC

XtreemFS
distributed WAN filesystem, still under development, write replication not yet implemented

Coda
distributed filesystem for roadwarriors, under development, needs client software

Lustre
load balancing, parallel I/O, used for HPC

rsync
available on every unix system, asynchronous not realtime, not bidirectional

DRBD
works like a software mirror filesystem over LAN, only one active node, max. 2 nodes, supports asynchonous mode over low bandwith

OpenAFS
distributed filesystem with server (Linux, Unix) and clients (e.g. Windows, Linux), needs client software installed

Ceph
distributed filesystem for HPC, data distributed like a stripeset for load balancing

Hadoop
written in java, single master, multiple slaves, no kernel integration

GlusterFS
HPC, distributed data for data centers, supports replication

MooseFS
distributed file system for data centers, single master server

Unison
bidirectional synchronization tool, asynchronous, uses rsync

Thanks for that list, it's more than I hoped for.

Now let's hope it helps & encourages other LQ-ers to help you find your answers.

Another way to get data replicated is using a service like Ubuntu One. This uses a userspace daemon to synchronize local changes to the Ubuntu One service which is an Amazon S3 storage. In the other direction changes are replicated from the service via events to the client. The whole replication is in near realtime, there is no delay or polling needed. The Ubuntu One service is only free up to a certain amount of data (2 GB), then it is 30$ per 20 GB per year.
The transmission is encrypted, the storage not. So you should consider encryption of the data. See https://wiki.ubuntu.com/UbuntuOne/Security