Linux - SoftwareThis forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I have around 7 million files totaling 1.4 TB on one Linux server that I need to mirror onto another. I'll do the initial copy using cp -a onto a USB hard drive but I then need to keep the mirror up-to-date. The servers are linked via our site-to-site VPN with about 10 mb/s throughput and 50 ms ping.
(The files, by the way, are the backuppc 4.0 pool, so holding the backups for our workstations. Fortunately backuppc 4.0 doesn't use hardlinks like the older versions did.)
The amount of data that's new or changed could be very variable. Usually it won't be much, but every once in a while there could be hundreds of gigs to shift, possibly in a single file.
Requirements:
Changes only need to propogate one way.
It's OK for the transfer to be scheduled, it doesn't need to be real time. (I can use run-one to ensure duplicate transfer processes don't get started.)
It needs to be possible to interrupt the transfer, start it again later, and minimise repeat work. (So it doesn't get stuck in an endless loop).
Ideally it can cope with the mirror source changing during process, though if required I can ensure it stays unchanged during the weekend.
Either server can initiate the process.
Encryption is not required (since the VPN encrypts the data over the internet).
Bandwidth limiting would be good. (But if not native, I can use trickle for that.)
Source is debian 9, destination is ubuntu 18.04, I would prefer to use software in the repos.
I would have just gone with rsync, but I've heard reports of it struggling with millions of files, so I wondered if people had any other suggestions? I use unison for two-way syncs but I've found it to be temperamental.
I haven't heard of such problems but would be interested if you can verify or debunk that rumor.
Or if the volume really is a problem, try rsync on several smaller subsets of the data.
There's an OpenRsync in the works over at the OpenBSD project. It should be fully interoperable with the original rsync but is a clean-room re-implementation. I'm not sure how far along they are with it though and if others have been able to port it to other systems yet.
I would rsync in sections of the tree. So if you have a folder structure like
Code:
Root
Root - WS1
Root - WS2
Root - WS3
Then write a job to enumerate the folders under root and do them as individual jobs. That way you've smaller jobs and the ability to detect failure in a more structured manner.
rsync has always had problems with massive numbers of files. The solution is snapshot, rather than having to read a bazillion inodes from disk. I long ago went with btrfs, but then I don't have a production environment to look after, and I'm anal about backups.
Snaps only track changes, and they are static (in the sense of "point-in-time"), so you can back them up at your leisure then delete them to recover the space. This is how I use them. With btrfs you can even only send the difference between two snaps (at the source) to save time/data. This is an *old* concept in the enterprise world.
I've even seen drivers at the VFS block level that will do similar for non-snapshot enabled filesystems (ext?, XFS ...) but have never tested them.
... The solution is snapshot, rather than having to read a bazillion inodes from disk. I long ago went with btrfs, but then I don't have a production environment to look after, and I'm anal about backups...
I've read that OpenZFS can do some kind of replication.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.