LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices

Reply
 
Search this Thread
Old 07-09-2012, 12:37 PM   #1
barnac1e
Member
 
Registered: Jan 2012
Location: Moorhead, Minnesota, USA (birthplace of Slackware, ironically)
Distribution: openSuSE 13.1 - KDE
Posts: 234
Blog Entries: 1

Rep: Reputation: 9
Question What's the point of bzip2, and other compression for that matter?


I just bzipped some files right, expecting this super awesome compressed size, right? Well, how come the output files are slightly larger than the input files? While I mention that, bzip2 is not the only compressor to do that, but I've seen rar of p7zip do that too. So why even call it compression? lol

I'd love to know the advantage of using it, really, to compress a large file when it does just the opposite?
 
Old 07-09-2012, 12:41 PM   #2
schneidz
Senior Member
 
Registered: May 2005
Location: boston, usa
Distribution: fc-15/ fc-20-live-usb/ aix
Posts: 3,917

Rep: Reputation: 600Reputation: 600Reputation: 600Reputation: 600Reputation: 600Reputation: 600
lempel-ziv compression scheme is an off-shoot of huffman encoding where symbols are grouped together using a binary tree to ensure that the prefix of one symbol does not match another symbol. with a large, varied sequence of data the output is usually more efficient than the original message (kinda' like morse code). with small samples of data it could be that the look up table plus encoded data would be slightly larger than the original message.
 
1 members found this post helpful.
Old 07-09-2012, 01:08 PM   #3
TobiSGD
Moderator
 
Registered: Dec 2009
Location: Hanover, Germany
Distribution: Main: Gentoo Others: What fits the task
Posts: 15,530
Blog Entries: 2

Rep: Reputation: 4024Reputation: 4024Reputation: 4024Reputation: 4024Reputation: 4024Reputation: 4024Reputation: 4024Reputation: 4024Reputation: 4024Reputation: 4024Reputation: 4024
http://www.linuxquestions.org/questi...8/#post4675213
 
Old 07-09-2012, 07:30 PM   #4
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Arch/XFCE
Posts: 17,802

Rep: Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728
Independent of the above, any compression algorithm depends on the entropy** of the original file. One of the many examples is a jpeg file where a relatively low quality was selected when it was generated.

I haven't run the detailed test but I'd guess that 10MB of recipes would compress better than the same total data volume of just about any photo format, but especially jpeg. Try it....

** "entropy" in this context is just a measure of randomness
 
Old 07-09-2012, 07:36 PM   #5
chrism01
Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.5, Centos 5.10
Posts: 16,261

Rep: Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028
In general, ASCII text compresses a whole lot better than 'binary' type files.
In some cases (as you have discovered) you end up with a bigger file ... these are sometimes known as pathological cases.

In short, the less 'texty'/more 'binary' the file, the more likely this is to happen.
Note also gzip, bzip2 (probably others) also have options to specify how hard to try and compress (on a range of 1-9 for these 2 tools); better compression takes longer to do of course.
http://linux.die.net/man/1/gzip
http://linux.die.net/man/1/bzip2

HTH
 
Old 07-09-2012, 08:12 PM   #6
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Arch/XFCE
Posts: 17,802

Rep: Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728
I just ran a little test. First I generated a file from /dev/urandom. Assuming that this is perfectly random data, I'd expect a compression algorithm to do nothing.

Results:
-rw-r--r-- 1 mherring users 3818175 Jul 9 20:54 rand_bzip2
-rw-r--r-- 1 mherring users 3801088 Jul 9 20:51 randfile
-rw-r--r-- 1 mherring users 3801910 Jul 9 20:53 rand_gzip

"randfile" is the original---note that the bzip2 and gzip results are both larger---bzip2 being the worst.

I then "de-randomized" the file by putting it thru hexdump, which adds line numbers and spaces. The de-randomized file is hexrand (larger as expected).
-rw-r--r-- 1 mherring users 18767881 Jul 9 21:04 hexrand
-rw-r--r-- 1 mherring users 6928688 Jul 9 21:05 hexrand_bz
-rw-r--r-- 1 mherring users 8778265 Jul 9 21:05 hexrand_gz

Note that bzip2 does a better job of compression than gzip.

Tentative conclusion: bzip2 is more efficient, but also has more overhead.
 
Old 07-09-2012, 08:17 PM   #7
chrism01
Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.5, Centos 5.10
Posts: 16,261

Rep: Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028
I believe that's the usual expectation; iirc bzip2 is newer.
However, I'd try with both set to max compression (level 9) just for comparison.
 
Old 07-09-2012, 08:20 PM   #8
jefro
Guru
 
Registered: Mar 2008
Posts: 11,334

Rep: Reputation: 1386Reputation: 1386Reputation: 1386Reputation: 1386Reputation: 1386Reputation: 1386Reputation: 1386Reputation: 1386Reputation: 1386Reputation: 1386
Zipped files have a basic way to checksum the data.
They also have an ability to store unique file system attributes.
 
Old 07-09-2012, 08:20 PM   #9
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Arch/XFCE
Posts: 17,802

Rep: Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728
I was just using:
tar -czvf
and
tar -cjvf

Now to RTFM the actual commands....
 
Old 07-09-2012, 08:28 PM   #10
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Arch/XFCE
Posts: 17,802

Rep: Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728
Using the -9 option made essentially no difference---the "FM" says that 6 is the default---I wonder if that's true when calling it from within tar?
 
1 members found this post helpful.
Old 07-09-2012, 09:00 PM   #11
barnac1e
Member
 
Registered: Jan 2012
Location: Moorhead, Minnesota, USA (birthplace of Slackware, ironically)
Distribution: openSuSE 13.1 - KDE
Posts: 234
Blog Entries: 1

Original Poster
Rep: Reputation: 9
Well, here is the thing I'd like to use, that is if it worked, compression software such as bzip2 for: Sometimes I move large files (totaling around 30 GB presently) from one partition to another for various reasons sometimes, whether its because I want to change a filesystem that the original (input) files are on to something else.

Now, it seems logical to me that if you are moving 30 GBs of files around, if compression software is there, why not use it? So that's what I tried out today with bzip2. I was hoping for a final output file size of maybe 20-25 GB thanks to the compression, but I realized on just the first first few files I realized that wasn't going to happen for whatever reason. And I read something prior in the man page about on a smaller file it can occasionally come out to a larger size but some of the individual files I'd try to compress were originally 2 GB so that "slightly larger" stipulation didn't apply to my situation. Now, I must admit that this was my very first compression attempt by the command line using Linux, whereas a year or more ago I would have used rar or even zip on Windows, but even then the file input and output were basically the same. So why do people even bother? What else could compression apps be used for? I mean, in the instance of a tar.xz file, sure, I have noticed they compressed size versus the unpacked sizes so that seems right to me. But bzip2 is supposed to be "so great" somehow, but I don't see it I guess.

So I mean in everyone's opinion, is that possible then? (to compress file x to a smaller file y for purposes of moving here and there faster?)

Last edited by barnac1e; 07-09-2012 at 09:06 PM. Reason: added info
 
Old 07-09-2012, 09:01 PM   #12
chrism01
Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.5, Centos 5.10
Posts: 16,261

Rep: Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028
I've always assumed it is, a reasonable compromise between compression vs speed. Each step up in compression (should) involve more processing time, but with this sort of stuff, it depends on each data file individually; you can only generalize up to point.

If you did the full spectrum of tests (1-9 using a loop) and also timed each run as well as outputting different files, you could table/graph the results.
It may flatten out after eg level5 but it would(?) be different graphs for plain text vs 'binary' type stuff (eg a C exe).
 
Old 07-09-2012, 09:08 PM   #13
chrism01
Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.5, Centos 5.10
Posts: 16,261

Rep: Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028Reputation: 2028
@OP: so far I don't think you've told us what kind of content you have? As per my post #5, (generally speaking) the more binary like, the less compression you'll get.
This is in addition to (as you mentioned) very small files where the encoding overhead can increase the final size.

Also keep in mind that the act of compressing very large files uses a lot of temp work space; you may run out of disk.
Sometimes for very large stuff, its quicker (end-to-end) to just copy it directly rather than waste time compressing it.
Compression is good for network txfrs, but you can sustain some pretty good throughput if just copying between local disks.

Last edited by chrism01; 07-09-2012 at 11:42 PM.
 
1 members found this post helpful.
Old 07-09-2012, 09:16 PM   #14
TobiSGD
Moderator
 
Registered: Dec 2009
Location: Hanover, Germany
Distribution: Main: Gentoo Others: What fits the task
Posts: 15,530
Blog Entries: 2

Rep: Reputation: 4024Reputation: 4024Reputation: 4024Reputation: 4024Reputation: 4024Reputation: 4024Reputation: 4024Reputation: 4024Reputation: 4024Reputation: 4024Reputation: 4024
If your files won't be smaller with using bzip2 or other compression algorithms then most likely that is not the fault of the compression tool, but the type of data you have to compress will simply not compress very good. This may happen sometimes, dependent on the data. For example movies, music and pictures can't be really compressed. This mostly has two reasons: Either the data is pretty random and therefore not compressable with the standard lossless algorithms, or it is already compressed (mpg, h264, mp3, jpg, ...).

That doesn't mean that the algorithm is bad, it is just not good for your specific case. It may give really good compression with different data.
 
2 members found this post helpful.
Old 07-09-2012, 11:44 PM   #15
barnac1e
Member
 
Registered: Jan 2012
Location: Moorhead, Minnesota, USA (birthplace of Slackware, ironically)
Distribution: openSuSE 13.1 - KDE
Posts: 234
Blog Entries: 1

Original Poster
Rep: Reputation: 9
Quote:
Originally Posted by TobiSGD View Post
If your files won't be smaller with using bzip2 or other compression algorithms then most likely that is not the fault of the compression tool, but the type of data you have to compress will simply not compress very good. This may happen sometimes, dependent on the data. For example movies, music and pictures can't be really compressed. This mostly has two reasons: Either the data is pretty random and therefore not compressable with the standard lossless algorithms, or it is already compressed (mpg, h264, mp3, jpg, ...).

That doesn't mean that the algorithm is bad, it is just not good for your specific case. It may give really good compression with different data.
OMG! Tobi, you are a genius! That's exactly what they were, h264 video files, .jpg images, and some others! That explains it then! Man, I would have never put those two scenarios together like you did. Thanks a lot!

So in conclusion, you really aren't going to get any faster results and smaller files, regardless of the compression software; considering the types of files I am moving, it's basically as good at everything gets?

Brilliant everyone (Tobi, chrism and pixellany!) Thanks for chipping in to make me realize all this! I like totally blanked about the media files being compressed already.

Last edited by barnac1e; 07-09-2012 at 11:49 PM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
bzip2: (stdin) is not a bzip2 file. error Aqua_Regia Linux - Newbie 8 04-15-2012 04:34 PM
chapter 6, want to use bzip2 but it fails with bash: bzip2: No such file or directory nomad5000 Linux From Scratch 2 10-12-2009 08:58 PM
Slax with SquashFS-4 new compression algorithm and layered compression ratios? lincaptainhenryjbrown Linux - Software 2 06-19-2009 05:29 PM
bzip2 andzerger Linux - Software 2 02-21-2004 04:23 AM
Bzip2 ? ironz Linux - Newbie 1 06-30-2003 11:44 PM


All times are GMT -5. The time now is 07:09 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration