Software to maintain mirror of millions of files?

cantab · 03-05-2019, 04:42 AM

Hi all,

I have around 7 million files totaling 1.4 TB on one Linux server that I need to mirror onto another. I'll do the initial copy using cp -a onto a USB hard drive but I then need to keep the mirror up-to-date. The servers are linked via our site-to-site VPN with about 10 mb/s throughput and 50 ms ping.

(The files, by the way, are the backuppc 4.0 pool, so holding the backups for our workstations. Fortunately backuppc 4.0 doesn't use hardlinks like the older versions did.)

The amount of data that's new or changed could be very variable. Usually it won't be much, but every once in a while there could be hundreds of gigs to shift, possibly in a single file.

Requirements:

Changes only need to propogate one way.
It's OK for the transfer to be scheduled, it doesn't need to be real time. (I can use run-one to ensure duplicate transfer processes don't get started.)
It needs to be possible to interrupt the transfer, start it again later, and minimise repeat work. (So it doesn't get stuck in an endless loop).
Ideally it can cope with the mirror source changing during process, though if required I can ensure it stays unchanged during the weekend.
Either server can initiate the process.
Encryption is not required (since the VPN encrypts the data over the internet).
Bandwidth limiting would be good. (But if not native, I can use trickle for that.)
Source is debian 9, destination is ubuntu 18.04, I would prefer to use software in the repos.

I would have just gone with rsync, but I've heard reports of it struggling with millions of files, so I wondered if people had any other suggestions? I use unison for two-way syncs but I've found it to be temperamental.

pan64 · 03-05-2019, 04:45 AM

I would still give it a try (to rsync).

Turbocapitalist · 03-05-2019, 04:50 AM

I haven't heard of such problems but would be interested if you can verify or debunk that rumor.

Or if the volume really is a problem, try rsync on several smaller subsets of the data.

There's an OpenRsync in the works over at the OpenBSD project. It should be fully interoperable with the original rsync but is a clean-room re-implementation. I'm not sure how far along they are with it though and if others have been able to port it to other systems yet.

TenTenths · 03-05-2019, 04:53 AM

I would rsync in sections of the tree. So if you have a folder structure like

Code:

Root
Root - WS1
Root - WS2
Root - WS3

Then write a job to enumerate the folders under root and do them as individual jobs. That way you've smaller jobs and the ability to detect failure in a more structured manner.

syg00 · 03-05-2019, 06:17 AM

rsync has always had problems with massive numbers of files. The solution is snapshot, rather than having to read a bazillion inodes from disk. I long ago went with btrfs, but then I don't have a production environment to look after, and I'm anal about backups.
Snaps only track changes, and they are static (in the sense of "point-in-time"), so you can back them up at your leisure then delete them to recover the space. This is how I use them. With btrfs you can even only send the difference between two snaps (at the source) to save time/data. This is an *old* concept in the enterprise world.
I've even seen drivers at the VFS block level that will do similar for non-snapshot enabled filesystems (ext?, XFS ...) but have never tested them.

Turbocapitalist · 03-05-2019, 06:34 AM

Quote:

Originally Posted by syg00

... The solution is snapshot, rather than having to read a bazillion inodes from disk. I long ago went with btrfs, but then I don't have a production environment to look after, and I'm anal about backups...

I've read that OpenZFS can do some kind of replication.