gzip directories: 1GB split, with CPU limit? (partial solution figured out)
Linux - ServerThis forum is for the discussion of Linux Software used in a server related context.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
gzip directories: 1GB split, with CPU limit? (partial solution figured out)
Hello,
I'm attempting to backup and compress website on a web server in preparation to be sent to Amazon Simple Storage Service (S3). I've run into the challenge of files needing to be smaller than 5GB before sending to S3 (requiring the split), and when I run the backup, it takes up a LOT of CPU power, and I want to cap that.
Here's what I've got so far.
Original script snippet (CPU sucker, 10GB file being created):
Code:
DirsToBackup[0]='/var/www/vhosts/thewebsite.com'
TmpBackupDir='/home/user/backup'
TodayDate=`date --date="today" +%d-%m-%y`
Today_TmpBackupDir=$TmpBackupDir'/'$TodayDate
## Make the backup directory (Also make it writable)
echo ''
echo 'Making Directory: '$Today_TmpBackupDir
mkdir $Today_TmpBackupDir
chmod 0777 $Today_TmpBackupDir
## GZip the directories and put them into the backups folder
echo ''
for i in "${DirsToBackup[@]}"
do
filename='dir-'`echo $i | tr '/' '_'`'.tar.gz'
echo 'Backing up '$i' to '$Today_TmpBackupDir'/'$filename
tar -czpPf $Today_TmpBackupDir'/'$filename $i
done
I'm told this version (untested) may cap the tar process at 5mb per second, also limiting CPU usage:
Code:
DirsToBackup[0]='/var/www/vhosts/thewebsite.com'
TmpBackupDir='/home/user/backup'
TodayDate=`date --date="today" +%d-%m-%y`
Today_TmpBackupDir=$TmpBackupDir'/'$TodayDate
## Make the backup directory (Also make it writable)
echo ''
echo 'Making Directory: '$Today_TmpBackupDir
mkdir $Today_TmpBackupDir
chmod 0777 $Today_TmpBackupDir
## GZip the directories and put them into the backups folder
echo ''
for i in "${DirsToBackup[@]}"
do
filename='dir-'`echo $i | tr '/' '_'`'.tar.gz'
echo 'Backing up '$i' to '$Today_TmpBackupDir'/'$filename
tar -czpPf $Today_TmpBackupDir'/'$filename | pv -L 5m >$i
done
Now, how would I combine the following example with the code above?
# create archives
$ tar cz my_large_file_1 my_large_file_2 | split -b 1024MiB - myfiles_split.tgz_
# uncompress another time
$ cat myfiles_split.tgz_* | tar xz
Would it look something like this?
Code:
DirsToBackup[0]='/var/www/vhosts/thewebsite.com'
TmpBackupDir='/home/user/backup'
TodayDate=`date --date="today" +%d-%m-%y`
Today_TmpBackupDir=$TmpBackupDir'/'$TodayDate
## Make the backup directory (Also make it writable)
echo ''
echo 'Making Directory: '$Today_TmpBackupDir
mkdir $Today_TmpBackupDir
chmod 0777 $Today_TmpBackupDir
## GZip the directories and put them into the backups folder
echo ''
for i in "${DirsToBackup[@]}"
do
filename='dir-'`echo $i | tr '/' '_'`'.tar.gz'
echo 'Backing up '$i' to '$Today_TmpBackupDir'/'$filename
tar -czpPf $Today_TmpBackupDir'/'$filename | pv -L 5m > | split -b 1024MiB - $i_split.tgz_
done
Thanks for taking a look! My head is starting to spin.
From a syntax point of view, the ">" in the pv command is wrong. Just remove it.
A bigger problem though: This code creates tar files named $Today...(etc) and pipes the output of tar into pv and then split. I don't think you want to split the tar messages! Instead, create the tar archive on stdout by replacing the $Today... filename with a dash. This should at least be a big step forward, or perhaps the solution.
According to this blog, it looks like I may need to gzip everything first, before splitting it up. If this is true and combined with your great tips, maybe this would look a touch better (untested still, will play tomorrow):
Code:
...
for i in "${DirsToBackup[@]}"
do
filename='dir-'`echo $i | tr '/' '_'`'.tar.gz'
echo 'Backing up '$i' to '$Today_TmpBackupDir'/'$filename
tar -czpPf $Today_TmpBackupDir'/'$filename | pv -L 5m $i
split -b 1024m $Today_TmpBackupDir'/'$filename part-
done
...
Some comments if I may. tar does hog CPU. You could use 'nice' and 'ionice' (if available) but then the job just takes longer. It's a trade-off. Also compressing files is no requirement: you can basically split any regular file. Also there's other compressors like xz, 7zip and bzip2. Compressing should be done according to file contents: it doesn't make sense to compress files that are already binary formats like MP3 or SWF. Also I'd hash both the original tar ball and the split parts: if after transport an archive doesn't work you only have to efficiently re-transfer the part for which the hash doesn't match. HTH
For me at this time, compression type is not as important as the number of files being sent to Amazon S3 (for recovery time and # request costs). I have lots of text files, thousands of pre-compressed images.
The length of time required to wrangle the files is not an issue for me, as it will all happen while I sleep and hopefully stay out of the way. I guess that could change as web projects grow.
I appreciate your comment about tar being a CPU hog. I'll definitely look into installing ionice (thanks for sharing!!) and/or making use of nice. Are you aware of xz, 7zip or bzip2 not being a CPU hog?
Let me point out that this code will create the split archive as you intend, but the pv command won't limit IO. You are piping tar's stdout into pv, not the archive. In other words, you can leave out the part after the pipe sign.
Note I didn't check this. The main point: in the tar command string, use the dash as the archive name. This writes the archive to stdout, so that it can be piped into pv and ultimately split.
If the $filename is small, though, I doubt that there will be much IO limiting though.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.