a better backup method?

Cyberman · 03-27-2009, 04:41 AM

I didn't know if I should put this in server or programming, so I decided to put it here. But I'd like to say it would be ideal if things could be in bash.

Now, I have this idea. It came from being highly annoyed with backup methods. I know there are some good ones, but I think they could be better. There are symbolic links and ways to point to objects.

One of my issues is that with an incremental backup: it backups the new locations, even if the file didn't change. So, it wastes space. Sure, the file might not have changed, but it's location did; thus, the backup program, such as sbackup, thinks that it's a new file and needs to be added to an incremental backup tar.

ex:

/home/blahblahblah/jackhandy.mpeg
was put into the full backup and the file was 2GB.

the next week it was moved to...
/home/blahblahblah/jackhandyfiles/jackhandy.mpeg

And yet typical backup methods put this inside of the incremental backup.
That's annoying and wasteful.

Anyone see a problem with that? I think there could be an improvement.

It would be ideal if the program would check the file's checksum/properties against files of the same name and simply link to the file in a previous full/incremental backup.

So, I created a general thought plot as to how backup methods could be improved and used. Tell me if any of you understand what I'm getting at and think backup methods should be like this in the future:

Code:

1) Full backup
2) Incremental backup
3) Restoration from last incremental backup
4) Restoration from any point

2) Incremental backup

Incremental backup attributes:
1. Logs all instances that have changed since the last full backup
a. Logs if file is no longer there.
b. Logs if file has moved.
ba. Checks if file that has moved is the same file.
baa. If the file that has moved is not the same file (checksum different), then it is copied into the incremental backup.
bab. This new file's checksum is logged.
bb. If the file that has moved is the same file (same checksum), only a symlink is created to its new place of destination.
bba. This symlink points to the same file's old location that exists in either an incremental backup or in the full backup.
bba*. This prevents the file from being copied again, which would increase storage requirements.

Ways to make the checksum process easier:
1. Log the checksum of files only over a certain size, such as 1MB or 100KB.

3) Restoration from last incremental backup

1. Directory tree is created according what the most recent version of the tree should look like.
2. Symlinks are turned into the actual file they link to.

MensaWater · 03-27-2009, 10:32 AM

No - I think backups are "point in time" snapshots so it makes perfect sense to me that a moved file should be backed up again because its "fully qualified path name" is completely different than it was before.

YOU my know where every single file on your system is at a given point but I doubt many others want to keep that detail in their head.

You could of course avoid this hassle by not moving your files around all the time.

By the way there is a thing in the world called "data deduplication" (See Data Domain for example). The idea in this is that it does keep track of bytes and hashes so only backs up what is truly unique data from backup to backup (and usually compresses as well). Using this technology you might get something like 80 to 1 compression on backups of sparse files. This kind of solution makes sense especially for multiple servers with the same OS as it will not copy all the same OS files for every one of them - it will only create pointers for the ones after the first one.