LinuxQuestions.org
Review your favorite Linux distribution.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Server
User Name
Password
Linux - Server This forum is for the discussion of Linux Software used in a server related context.

Notices


Reply
  Search this Thread
Old 12-09-2013, 06:11 PM   #1
Ultrus
Member
 
Registered: Jan 2006
Posts: 50

Rep: Reputation: 15
Question gzip directories: 1GB split, with CPU limit? (partial solution figured out)


Hello,
I'm attempting to backup and compress website on a web server in preparation to be sent to Amazon Simple Storage Service (S3). I've run into the challenge of files needing to be smaller than 5GB before sending to S3 (requiring the split), and when I run the backup, it takes up a LOT of CPU power, and I want to cap that.

Here's what I've got so far.

Original script snippet (CPU sucker, 10GB file being created):

Code:
DirsToBackup[0]='/var/www/vhosts/thewebsite.com'

TmpBackupDir='/home/user/backup'

TodayDate=`date --date="today" +%d-%m-%y`

Today_TmpBackupDir=$TmpBackupDir'/'$TodayDate

## Make the backup directory (Also make it writable)
echo ''
echo 'Making Directory: '$Today_TmpBackupDir
mkdir $Today_TmpBackupDir
chmod 0777 $Today_TmpBackupDir

## GZip the directories and put them into the backups folder
echo ''
for i in "${DirsToBackup[@]}"
do
	filename='dir-'`echo $i | tr '/' '_'`'.tar.gz'
	echo 'Backing up '$i' to '$Today_TmpBackupDir'/'$filename
	tar -czpPf $Today_TmpBackupDir'/'$filename $i
	
done
I'm told this version (untested) may cap the tar process at 5mb per second, also limiting CPU usage:

Code:
DirsToBackup[0]='/var/www/vhosts/thewebsite.com'

TmpBackupDir='/home/user/backup'

TodayDate=`date --date="today" +%d-%m-%y`

Today_TmpBackupDir=$TmpBackupDir'/'$TodayDate

## Make the backup directory (Also make it writable)
echo ''
echo 'Making Directory: '$Today_TmpBackupDir
mkdir $Today_TmpBackupDir
chmod 0777 $Today_TmpBackupDir

## GZip the directories and put them into the backups folder
echo ''
for i in "${DirsToBackup[@]}"
do
	filename='dir-'`echo $i | tr '/' '_'`'.tar.gz'
	echo 'Backing up '$i' to '$Today_TmpBackupDir'/'$filename
	tar -czpPf $Today_TmpBackupDir'/'$filename | pv -L 5m >$i
	
done
Now, how would I combine the following example with the code above?

source: http://stackoverflow.com/questions/1...z-zip-or-bzip2

Code:
# create archives
$ tar cz my_large_file_1 my_large_file_2 | split -b 1024MiB - myfiles_split.tgz_
# uncompress another time
$ cat myfiles_split.tgz_* | tar xz
Would it look something like this?

Code:
DirsToBackup[0]='/var/www/vhosts/thewebsite.com'

TmpBackupDir='/home/user/backup'

TodayDate=`date --date="today" +%d-%m-%y`

Today_TmpBackupDir=$TmpBackupDir'/'$TodayDate

## Make the backup directory (Also make it writable)
echo ''
echo 'Making Directory: '$Today_TmpBackupDir
mkdir $Today_TmpBackupDir
chmod 0777 $Today_TmpBackupDir

## GZip the directories and put them into the backups folder
echo ''
for i in "${DirsToBackup[@]}"
do
	filename='dir-'`echo $i | tr '/' '_'`'.tar.gz'
	echo 'Backing up '$i' to '$Today_TmpBackupDir'/'$filename
	tar -czpPf $Today_TmpBackupDir'/'$filename | pv -L 5m > | split -b 1024MiB - $i_split.tgz_
	
done

Thanks for taking a look! My head is starting to spin.

Best regards,
 
Old 12-09-2013, 06:36 PM   #2
berndbausch
LQ Addict
 
Registered: Nov 2013
Location: Tokyo
Distribution: Mostly Ubuntu and Centos
Posts: 6,316

Rep: Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002
Code:
tar -czpPf $Today_TmpBackupDir'/'$filename | pv -L 5m > | split -b 1024MiB - $i_split.tgz_
From a syntax point of view, the ">" in the pv command is wrong. Just remove it.
A bigger problem though: This code creates tar files named $Today...(etc) and pipes the output of tar into pv and then split. I don't think you want to split the tar messages! Instead, create the tar archive on stdout by replacing the $Today... filename with a dash. This should at least be a big step forward, or perhaps the solution.

Thanks for letting me know about the pv command!
 
Old 12-09-2013, 08:44 PM   #3
Ultrus
Member
 
Registered: Jan 2006
Posts: 50

Original Poster
Rep: Reputation: 15
Thanks for the feedback berndbausch.

According to this blog, it looks like I may need to gzip everything first, before splitting it up. If this is true and combined with your great tips, maybe this would look a touch better (untested still, will play tomorrow):

Code:
...

for i in "${DirsToBackup[@]}"
do
	filename='dir-'`echo $i | tr '/' '_'`'.tar.gz'
	echo 'Backing up '$i' to '$Today_TmpBackupDir'/'$filename
	tar -czpPf $Today_TmpBackupDir'/'$filename | pv -L 5m $i
	split -b 1024m $Today_TmpBackupDir'/'$filename part-
	
done

...
 
Old 12-10-2013, 01:51 AM   #4
unSpawn
Moderator
 
Registered: May 2001
Posts: 29,415
Blog Entries: 55

Rep: Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600
Some comments if I may. tar does hog CPU. You could use 'nice' and 'ionice' (if available) but then the job just takes longer. It's a trade-off. Also compressing files is no requirement: you can basically split any regular file. Also there's other compressors like xz, 7zip and bzip2. Compressing should be done according to file contents: it doesn't make sense to compress files that are already binary formats like MP3 or SWF. Also I'd hash both the original tar ball and the split parts: if after transport an archive doesn't work you only have to efficiently re-transfer the part for which the hash doesn't match. HTH
 
Old 12-10-2013, 08:37 AM   #5
Ultrus
Member
 
Registered: Jan 2006
Posts: 50

Original Poster
Rep: Reputation: 15
Hello unSpawn,
Thanks for your feedback!

For me at this time, compression type is not as important as the number of files being sent to Amazon S3 (for recovery time and # request costs). I have lots of text files, thousands of pre-compressed images.

The length of time required to wrangle the files is not an issue for me, as it will all happen while I sleep and hopefully stay out of the way. I guess that could change as web projects grow.

I appreciate your comment about tar being a CPU hog. I'll definitely look into installing ionice (thanks for sharing!!) and/or making use of nice. Are you aware of xz, 7zip or bzip2 not being a CPU hog?

Catching up on hash now...

Thanks!
 
Old 12-10-2013, 01:00 PM   #6
unSpawn
Moderator
 
Registered: May 2001
Posts: 29,415
Blog Entries: 55

Rep: Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600
Quote:
Originally Posted by Ultrus View Post
Are you aware of xz, 7zip or bzip2 not being a CPU hog?
No.


Quote:
Originally Posted by Ultrus View Post
Catching up on hash now...
See 'man md5sum' or just
Code:
md5sum /path/file
# or
openssl md5 /path/file
# as in
find /path -type f -print0 | xargs -0 -iX md5sum 'X' > /path/hashes.md5
# or
md5deep -r /path > /path/hashes.md5
 
Old 12-10-2013, 05:31 PM   #7
berndbausch
LQ Addict
 
Registered: Nov 2013
Location: Tokyo
Distribution: Mostly Ubuntu and Centos
Posts: 6,316

Rep: Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002
Quote:
Originally Posted by Ultrus View Post
Code:
	tar -czpPf $Today_TmpBackupDir'/'$filename | pv -L 5m $i
	split -b 1024m $Today_TmpBackupDir'/'$filename part-
Let me point out that this code will create the split archive as you intend, but the pv command won't limit IO. You are piping tar's stdout into pv, not the archive. In other words, you can leave out the part after the pipe sign.
 
Old 12-11-2013, 10:04 AM   #8
Ultrus
Member
 
Registered: Jan 2006
Posts: 50

Original Poster
Rep: Reputation: 15
Hmmmm. Is there a way to keep the IO limiting in there somehow? I'm looking into ionice now...
 
Old 12-12-2013, 12:51 AM   #9
berndbausch
LQ Addict
 
Registered: Nov 2013
Location: Tokyo
Distribution: Mostly Ubuntu and Centos
Posts: 6,316

Rep: Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002
Quote:
Originally Posted by Ultrus View Post
Hmmmm. Is there a way to keep the IO limiting in there somehow? I'm looking into ionice now...
Something like
Code:
tar -czpPf - $Today_TmpBackupDir'/$filename | pv -L 5m $i | split -b 1024m part-
Note I didn't check this. The main point: in the tar command string, use the dash as the archive name. This writes the archive to stdout, so that it can be piped into pv and ultimately split.
If the $filename is small, though, I doubt that there will be much IO limiting though.
 
  


Reply

Tags
backup, cpu, gzip, split, tar



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] piping gzip into split soupmagnet Linux - Newbie 5 12-13-2012 10:45 PM
Split a gzip file ZAMO Linux - Server 6 08-21-2012 10:30 AM
log file size limit 2.1gb saurabh parli Linux - Software 6 06-18-2007 11:32 AM
Xine/Kaffeine problem - Partial Solution LinuxHobbit SUSE / openSUSE 5 03-13-2006 09:52 AM
mkinitrd problem. Partial solution. Help! wazungy Mandriva 3 03-21-2004 10:21 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Server

All times are GMT -5. The time now is 03:48 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration