LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (http://www.linuxquestions.org/questions/linux-software-2/)
-   -   rsync Backups (http://www.linuxquestions.org/questions/linux-software-2/rsync-backups-930394/)

d072330 02-20-2012 06:30 PM

rsync Backups
 
Is there a way with rsync to copy a large directory to multiple USB drives?

I have a directory that is 8.2 TB large and we need to get this to the client and the client only wants 2 TB drives.

So my question is. Is there a way to use rsync to fill up the first USB drive and then have rsync ask for or after manual change of the USB drive for rsync to know where it left off and start copying data to the second USB drive at the point it left off on the first USB drive?

Clear as mud, I hope so!

jhwilliams 02-21-2012 02:54 PM

Just one idea. Please check this against the man page and test it before you try to implement.

The general idea is to make a list of everything you want to copy, then start copying it. While copying, annote every file that has copied OK. Exclude these files as you do subsequent copies.

First, create a payload list.

Code:

rsync -ani /src /dest | \
    awk '{ print $NF }' 2>/dev/null | \
    tee payload.txt

Now, repeat this command by hand, running it once per drive, starting with a new empty one:

Code:

rsync -ai \
      --files-from=payload.txt \
      --exclude-from=completed.txt \
      /src /dest \
      2>/dev/null | \
    awk '{ print $NF }' >> completed.txt

But my real solution for you is to inspect the source directory by hand and break the content up by subdirectories in a way that makes sense. For example, you might have /src/engineering_files, /src/hr_docs, /src/ceo_data. Just sync each tree to a separate disk by hand.

Also, tar is good at creating multi-volume archives. There might be some solution with tar that works for you.

A.Thyssen 02-22-2012 01:40 AM

It seems to me that is this almost identical to the old problem of packing files in limited news and mail messages.

Have a look at the various SHAR (shell archive) programs (there were lots of them), which not only packaged files, but also split them into groups with a total size limit on each group. In some you to extract specific files from one specific group. Files that were too big were split into smaller segments over multiple 'messages'.

I am sure that software is still around.


For another method you could try the RAR archive. whcih generates a large split archive.
It can also recover files from have a few 'pieces'.


Please be sure to let us know whatever solution you do come up with!
It has a lot of relevance, not just to USB sticks, but CD and DVD data storage
as well.


ASIDE: this is actually known as a 'packaging' problem and has been shown to be NP-complete programming problem. That is there is no one 'perfect' solution that does not take a polynomial time calculation. However todays computers are fast enough that typically this is no barrier for any 'practical' situation.

d072330 02-22-2012 11:12 AM

Will keep you posted. Currently I am working on a Perl script. If this works is there a good place to post something like this so the masses can have it? Is there anything else that needs to be done to the script before putting to general use (i.e. putting GNU info in it etc.)?

sag47 02-22-2012 11:53 AM

If it's a script you can simply post it in CODE tags in a reply post here in this thread.

A.Thyssen 02-22-2012 08:39 PM

You can always upload it to a site like a public 'dropbox' folder, then post a link here.

NOTE: I like using CPAN for more complex things, but they are module oriented, without a proper place for scripts that don't need modules!

d072330 02-27-2012 11:53 AM

My Perl script is working like a champ so far. Have a few more tweaks then I will post here.

d072330 01-29-2013 04:28 PM

Code that I forgot to post and as always there probably is another way of doing this but this is the way I did it.

Quote:

#!/usr/bin/perl
#########################################################
# This script will take user input for source and #
# upto 2 destinations and then create an array of the #
# source and destinations and push to @diff array. Once #
# the @diff array is populated it will then use rsync #
# to copy the data from source to destination1 and then #
# when destination1 is full is will roll over to #
# destination2. #
# #
# Update History: #
# 21-Feb-2012 - +added script usage #
# 22-Feb-2012 - +added comments to the script #
# 23-Feb-2012 - +added change ownership to destinations #
# 24-Feb-2012 - -removed change ownership #
# 24-Feb-2012 - -removed elsif statement for moving to #
# next disk when first disk is full #
# +changed ls -l to ls -lk to get 1024 KB #
# -removed nagios alert for now #
# 25-Feb-2012 - +added @space2, $info2 for $ddir2 #
# 27-Feb-2012 - +added rsync command with no logging #
# #
#########################################################

#--------------------------#
# Script settings #
#--------------------------#
use diagnostics;
use strict;
use warnings;

#--------------------------#
# Script Usage #
#--------------------------#
if (@ARGV != 3) {
print "\n\n";
print "usage: rsync_copy.pl <source drive> <destination drive1> <destination drive2>\n";
print "example: rsync_copy.pl /home/user /mnt/drive1 /mnt/drive2\n";
exit;
}

#--------------------------#
# Global Variables #
#--------------------------#
my $rsync = "/usr/bin/rsync -avp --progress";
my $bdir = "<put your directory here>";
my $sdir = $ARGV[0];
my $ddir1 = $ARGV[1];
my $ddir2 = $ARGV[2];
my $logdir = "<your directory here>";
my $ofile1 = "rsync_1-";
my $ofile2 = "rsync_2-";
my $error = "/tmp/changedrive";
my (@files, @dest1, @dest2, @diff, @isect, %count, @space1, @space2, @disk1, @disk2, @filesize, @fsize);
my ($files, $dest, $nfile, $diff, $isect, $item, $space1, $space2, $filesize, $fs, $fsize);
my ($info1, $info2, $disk1, $disk2, $part, $size, $used, $free1, $free2, $perc, $file);
my ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = localtime(time);
my $datestring = sprintf("%4d%02d%02d%02d%02d%02d",($year + 1900),($mon+1),$mday,$hour,$min,$sec);

#--------------------------#
# Main Routine #
#--------------------------#
### Get the source drive (server) contents ###
@files = `ls $bdir$sdir`;
#@files = sort(@files); ### Only remove comment if testing to see order ###
chomp @files;

### Get the destination source #1 contents ###
@dest1 = `ls $ddir1`;
#@dest1 = sort(@dest1); ### Only remove comment if testing to see order ###
chomp @dest1;

### Get the destination source #2 contents ###
@dest2 = `ls $ddir2`;
#@dest2 = sort(@dest2); ### Only remove comment if testing to see order ###
chomp @dest2;


@isect = ( ); ### Files that intersect in both @dest1 and @files ###
@diff = ( ); ### Files that are different between @dest1 and @files ###
%count = ( );

foreach $item (@dest1, @dest2, @files) { $count{$item}++;}

foreach $item (keys %count)
{
if ($count{$item} == 2)
{
push @isect, $item;
#@isect = sort(@isect); ### Only remove comment if testing to see order ###
}
else
{
push @diff, $item;
@diff = sort(@diff); ### sort the array to rsync files in order (i.e. file1, file2 etc) ###
}
}

### uncomment only if debugging output of arrays ###
#print "\ndest1 Array = @dest1\n";
#print "\ndest2 Array = @dest2\n";
#print "\nfiles Array = @files\n";
#print "\nIntersect Array = @isect\n";
#print "\nDiff Array = @diff\n\n";

### If @diff has files then proceed ###
if (@diff)
{
foreach $diff (@diff)
{
### Get the free space of the destination source ###
@space1 = `df -k $ddir1`;
@space2 = `df -k $ddir2`;

### Get the second line of the df -k output ###
$info1 = $space1[1];
$info2 = $space2[1];

### Split the df -k output by space ###
@disk1 = split(' ',$info1);
@disk2 = split(' ',$info2);

### Seperate the values by size, free, partition etc. ###
$free1 = $disk1[3];
$free2 = $disk2[3];

### Get the file size of the files in the @diff array ###
@filesize = `ls -lk $bdir$sdir/$diff`;
foreach $filesize (@filesize)
{
chomp $filesize;

### Split the ls -lk by space ###
@fsize = split(' ',$filesize);

### Get the file size field ###
$fs = $fsize[4];

### Example of outputs from df -k and ls -lk ###
#disk space left = 57632320 = (55 GB)
#file size in kb = 00056332 = (56 MB)

### Check to see if free space on destination source will allow the next file ###
# 57632320 >= 00056332
if ($free1 >= $fs)
{
print "Free disk space (KB): $free1\n";

### rsync the files to destination source and output to a file for reading ###
#`$rsync $bdir$sdir/$diff $ddir1 >> $logdir$ofile1$datestring.txt`; ### Use this line if you want to log it
`$rsync $bdir$sdir/$diff $ddir1`; ### Comment this line out if you use logging
print "$rsync $bdir$sdir/$diff $ddir1\n";
}
### Changed this line to add 2nd if condition ###
else
{
print "Free disk space (KB): $free2\n";

### rsync the files to destination source and output to a file for reading ###
#`$rsync $bdir$sdir/$diff $ddir2 >> $logdir$ofile2$datestring.txt`; ### Use this line if you want to log it
`$rsync $bdir$sdir/$diff $ddir2`; ### Comment this line out if you use logging
print "$rsync $bdir$sdir/$diff $ddir2\n";
}
}
}
}
else
{
### Print to screen that there is nothign left to rsync ###
print "\nNothing to rsync\n\n";
}

exit 0;
#--------------------------#
# End of Script #
#--------------------------#

A.Thyssen 01-29-2013 06:45 PM

Hmmmm you do know that rsync can do the comparision itself, using file sizes, times, and block level checksums (eg only the end of log files are updated).

Comparing files outside rsync, basically would involve the equivelent of copying the files anyway.

d072330 01-30-2013 11:05 AM

No did not know this, good to know. The biggest issue I had was using rsync to copy files from disk to USB then when USB #1 fills up roll over to the second USB drive and so on. If rsync will do this as well please by all means post the command line arguments LOL.

A.Thyssen 01-30-2013 07:01 PM

It is a intergral part of rsync to only transfer the changes. It was specifically designed with slow modems in mind. This is what makes it different to a normal 'file copy' such as scp, cp, tar, cpio, and so on.

Rsync only replaces files on the destination (breaking any hardlinked copies), if a file data changes, which is why you can create large numbers of 'snapshots' (even once an hour) using very little disk space.

Such rsync backups are not compressed, which allows each snapshot to be look almost exactly like a simple full working copy of the directories that were backed up. That is, it is easy to search, and access any file in any snapshot. You do not have do searching multiple incremental compressed backup files just to recover a specific bit of data, prehaps without knowning the exact filename that data is in. Just search for it directly as you normally would, across all the snapshots. It is the hard linking of unchanged files that gives a rsync multi-snapshot backup method such a good compression.

However hardlinks only work on the same disk storage mount, so each USB would have to have at least one full copy of the files being backed up. Also hardlinked snapshoting will require... hard links.. which requires a UNIX style filesystem. USB sticks typically only use a low level VFAT filesystem (no hardlinks, and DOS file attributes) for maximum compatibility.

As such USB sticks may need a different filesystem for it to work well. And larger USB drives with say a EXT4 filesystem tends to work better. It allows more hardlinked snapshots from the initial full copy (or last snapshot depending on how you look at it), and this higher disk space savings (hardlink compression) per snapshot.

A.Thyssen 01-30-2013 07:09 PM

ASIDE: The use of a cloud based filesystem (like dropbox) also precludes the use of hardlinks. As such snapshoting to such a filesystem does not compress well as you do not get hardlink sharing of files accross individual snapshots.

However making snapshot backups on a local machine, of a (prosibly encrypted) cloud based 'working' filesystem that can be shared accross devices, should work very well.

That one local machine keeps 'snapshot backups' (perhaps working automatically in the background), while the cloud allows access to the actual working directory from multiple locations.

If something happens to the cloud, or your working directory gets corrupted for some reason, you have your highly-hardlinked snapshots to recover from. It will be straight forward then to copy the last good snapshot to a new replacement cloud provider.


The last two posts have been included in my general notes (plain text file) on Rsync Backups and Snapshoting.
http://www.ict.griffith.edu.au/antho...c_backup.hints


All times are GMT -5. The time now is 03:52 AM.