LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 11-18-2009, 12:37 PM   #1
garydale
Member
 
Registered: Feb 2007
Posts: 142

Rep: Reputation: 23
does tar or bzip2 squash duplicate or near-duplicate files?


I've been playing around with my backup script. It used to compress /home into a single large tar.bz2 file about 4G in size (tar -cjf /tmp/backups/home.tar.bz2 /home/* ).

I thought I'd try backing up individual directories instead. In the first cut, there are only two significant directories - one for shared folders and the other for samba profiles, etc. The Linux user homes are very small, since I'm the only one who uses a Linux shell. Most users don't have a Linux home directory.

The code for the archiving is:
cd /home
for dir in *; do
if [ -d $dir ]; then
tar -cjf /tmp/backups/$dir.tar.bz2 $dir
fi
done

On the test run, the two big directories compressed to more than 5G - about 30% larger than just having one archive and too large to fit on a DVD-RAM disk.

My question is why? My first guess would be that the users work on files on their local drives then copy the completed file to the network share. If bzip2 has all the files to deal with, it could be able to squash the near-duplicate drafts down to almost nothing, but when I split the share archive off from the profile archive, it can't do that.

This would be truly impressive since it would seem to require bzip2 to track patterns through the entire archive.

Is that what is happening? Or is there some other explanation for the increased size?
 
Old 11-18-2009, 02:13 PM   #2
rweaver
Senior Member
 
Registered: Dec 2008
Location: Louisville, OH
Distribution: Debian, CentOS, Slackware, RHEL, Gentoo
Posts: 1,833

Rep: Reputation: 167Reputation: 167
Most compression works on the concept of eliminating repeating patterns throughout the file in a consistent manner. Usually compressing more data together results in a better ratio because it finds more patterns to compress and you see less redundancy in the resulting file than if you compressed it into multiple archives.
 
Old 11-18-2009, 02:27 PM   #3
linus72
LQ Guru
 
Registered: Jan 2009
Location: Gordonsville-AKA Mayberry-Virginia
Distribution: Slack14.2/Many
Posts: 5,573

Rep: Reputation: 470Reputation: 470Reputation: 470Reputation: 470Reputation: 470
Hello garydale

Was wondering whether you had tries squashfs, squashfs-lzma, or dir2lzm?

I usually use squashfs-lzma, but on big files it will take alot longer to compress vs squashfs/dir2lzm
 
Old 11-18-2009, 05:27 PM   #4
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Rocky 9.2
Posts: 18,359

Rep: Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751
Both bzip2 http://linux.die.net/man/1/bzip2 and gzip http://linux.die.net/man/1/gzip have options (values 1-9) to specify the level of compression. The higher/better the level, the longer it takes, but the smaller the output. Your choice.
 
Old 11-19-2009, 08:11 AM   #5
garydale
Member
 
Registered: Feb 2007
Posts: 142

Original Poster
Rep: Reputation: 23
It turns out that someone had actually increased the amount of space they were using rather dramatically. That's what led to the increased size - it was just a weird coincidence that it happened the same time I changed my script.

This leaves me with the problem of trying to fit the extra data into the DVD-RAM disk. I'm probably going to have to create an archive set and just back up the newer files from now on.

Thanks.
 
Old 11-19-2009, 03:06 PM   #6
garydale
Member
 
Registered: Feb 2007
Posts: 142

Original Poster
Rep: Reputation: 23
Found an easier solution. It turns out lzma is a lot better than bzip2 for compressing files. Switching to it saved about 20%, which gives me a comfortable margin for the next little while. The only changes I needed to make to my script were to change the "j" option to "a" (automatic) on the tar command line and rename the archive to end in .lzma.

Lzma is somewhat slower in compressing according to what I've read, but the better compression is worth it in my case. Also, it's apparently also faster in decompression, which should save me time when I need to recover files.
 
Old 11-19-2009, 04:43 PM   #7
linus72
LQ Guru
 
Registered: Jan 2009
Location: Gordonsville-AKA Mayberry-Virginia
Distribution: Slack14.2/Many
Posts: 5,573

Rep: Reputation: 470Reputation: 470Reputation: 470Reputation: 470Reputation: 470
dir2lzm is about same as squashfs-lzma, I think

what version of squashfs-lzma are you using?

you can get latest squashfs-lzma and dir2lzm in any slax iso/tar.gz
in the slax/tools folder
that is squashfs-lzma 4.0

check this out too
http://www.squashfs-lzma.org/
 
  


Reply

Tags
bzip2, tar



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Finding duplicate files SlowCoder Linux - General 6 10-12-2007 08:25 AM
deleting duplicate files cs-cam Linux - General 3 11-14-2006 11:27 PM
duplicate files in one folder! hornung Linux - Enterprise 1 01-13-2005 03:35 PM
Duplicate Files and linux carl0ski Linux - Software 1 12-22-2004 04:45 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 07:47 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration