Question: Incremental backups of large files that change frequently
Linux - SoftwareThis forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
I wish to implement this backup strategy on others computers. If not this exact strategy, something functionally equivalent. I have run into a potential problem though. These other users are not so computer savvy, and they tend to have really huge email stores. For example, in Thunderbird, they have "Inbox", "Sent", and "Trash" files that combine to be over 2Gb. Now, only a very small portion of those huge files changes daily, but they do change daily, and with standard backup strategies being file-based you end up with these huge files being backed up in their entirety day after day, consuming much disk space.
Other than training users to keep cleaner/smaller mail stores, is there a backup strategy/solution to handle this dilemma? Other than email, the rest of the users data is quite static, and backups are easy for that other data. But for these users, pretty much computer=email, so that's what needs to be backed up.
I was toying with the idea of using some kind of source code control system on their huge email files, since things like rcs and sccs store only the file changes in their revisions (for text files). I'd have to find some way to break out the original file stored in the source code system from its revisions so that these things would be stored in seperate files. I'd then have to have cron "check in, then check out" their huge email stores into source code control before a backup. And have the backup script exclude the original huge email store files and backup their source code control equivalent files instead. Utilizing source code control is just a way of me being lazy and not writing my own "diff" scripting strategy.
I figure somebody has probably implemented some function like this and that would save me the time of re-inventing the wheel. Does anybody know of such a thing?
Alternately, I could give up on trying to parse their large email stores, just back them up in their grandiose magnitude each day, but limit each user to XXX amount of backup storage space. For users with clean/small email stores, they may get months and months of incremental backups. For the "dirty" users, they may only get a few days.
Any ideas on how to implement this, or any alternate/better strategies? Thanks in advance.
There is a program called rdiff-backup that would be ideally suited for this. The history of the file is stored as a series of reverse diffs from the current state. The program (a set of Python scripts, actually) has unfortunately not seen any active development since 2009, but it works quite well. The outstanding bugs mainly have to do with keeping the archive on a Windows filesystem, and with breaking (on restoral) sets of hard linked files into 2 or more separate groups if names have been added over time to the set. About the only operational shortcoming is that there is no way to go into, say, an archive that has been updated daily for 2 years and keep only the monthly increments for the older dates.
(I've heard that close examination of the Python code has been associated with increased risk of cancer of the eyeballs in the state of California.)
Last edited by rknichols; 09-19-2013 at 06:23 PM.
Reason: Add caution about examining the source
Having a central IMAP mailstore, where each mail corresponds to a file (e.g. maildir) makes easier not only to back up the mail, but also to restore it. The users can continue to use Thunderbird, Outlook, etc, and freely switch among them.
That would be nice, but I am not talking about a corporate setup here. These users are "my father-in-law", "my sister", "my parents", etc. Just personal PC's (I have installed Linux for them, to make my job of "free remote support" easier!) Their email providers are their individual ISP's, and are all POP. The thought is good though ... thanks!
There is a program called rdiff-backup that would be ideally suited for this.
I was not aware of this program. I did a brief look into it and as you say, it may be just what I'm looking for. I will have to research it in more depth and do some testing with it when I get a little more free time. Thanks for the pointer!
I am wondering if there is an email client that uses POP3/SMTP to interface with the ISP mail server, and that client stores the emails locally as individual files.
e.g., Thunderbird stores multiple emails in large files, - "Inbox", "Sent", etc. Is there a client that instead of having, say, an "Inbox" file, has an Inbox directory, which in turn contains individual files for the various emails? A structure like that would make local backups trivial.
rsync addresses the problem of transmission, but neither rsync nor rsnapshot solve the problem of storage when you have very large files that each change just a little each day and you need to be able to recover the state they were in on some arbitrary date in the past.
Here's a link to a pretty good comparison between rsnapshot and rdiff-backup.
Last edited by rknichols; 09-21-2013 at 08:32 PM.
Reason: Add link to comparison article