does tar or bzip2 squash duplicate or near-duplicate files?
Linux - SoftwareThis forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
does tar or bzip2 squash duplicate or near-duplicate files?
I've been playing around with my backup script. It used to compress /home into a single large tar.bz2 file about 4G in size (tar -cjf /tmp/backups/home.tar.bz2 /home/* ).
I thought I'd try backing up individual directories instead. In the first cut, there are only two significant directories - one for shared folders and the other for samba profiles, etc. The Linux user homes are very small, since I'm the only one who uses a Linux shell. Most users don't have a Linux home directory.
The code for the archiving is:
cd /home
for dir in *; do
if [ -d $dir ]; then
tar -cjf /tmp/backups/$dir.tar.bz2 $dir
fi
done
On the test run, the two big directories compressed to more than 5G - about 30% larger than just having one archive and too large to fit on a DVD-RAM disk.
My question is why? My first guess would be that the users work on files on their local drives then copy the completed file to the network share. If bzip2 has all the files to deal with, it could be able to squash the near-duplicate drafts down to almost nothing, but when I split the share archive off from the profile archive, it can't do that.
This would be truly impressive since it would seem to require bzip2 to track patterns through the entire archive.
Is that what is happening? Or is there some other explanation for the increased size?
Most compression works on the concept of eliminating repeating patterns throughout the file in a consistent manner. Usually compressing more data together results in a better ratio because it finds more patterns to compress and you see less redundancy in the resulting file than if you compressed it into multiple archives.
It turns out that someone had actually increased the amount of space they were using rather dramatically. That's what led to the increased size - it was just a weird coincidence that it happened the same time I changed my script.
This leaves me with the problem of trying to fit the extra data into the DVD-RAM disk. I'm probably going to have to create an archive set and just back up the newer files from now on.
Found an easier solution. It turns out lzma is a lot better than bzip2 for compressing files. Switching to it saved about 20%, which gives me a comfortable margin for the next little while. The only changes I needed to make to my script were to change the "j" option to "a" (automatic) on the tar command line and rename the archive to end in .lzma.
Lzma is somewhat slower in compressing according to what I've read, but the better compression is worth it in my case. Also, it's apparently also faster in decompression, which should save me time when I need to recover files.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.