LinuxQuestions.org
Review your favorite Linux distribution.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 09-03-2009, 01:36 PM   #1
brianpbarnes
Member
 
Registered: Dec 2005
Posts: 143

Rep: Reputation: 15
How to effectively compress a 'dd' system disk image formatted to ext3?


Hi,
The only time I reboot my Suse 11.1 server is when a kernel update requires it. When I reboot, I sometimes boot from the Ubuntu 'live cd' because it has dd on it. I cd to /dev/disk/by_id, find the ID of a suitable data drive, make an fstab record for it and mount it.

Dd is great in that it gives you very nearly hardware max speed with Raptor 74s reporting 68 MB/s copying a device (my root drive) to a file on a much larger data drive.

Df on the slash drive reports 18 GB total and ~9 GB free. Dd records the entire 18 GB. When I compress it with:
bzip2 --stdout --compress --best --keep /vter/vulcan.sys.sdc3.20090902.dd > /dat/vulcan.sys.sdc3.20090902.dd.bz2
it only compresses down to 17.2 GB. This is after hours of grinding, even though 9 GB of it is empty space and bzip2 is compressing with --BEST. Usually, system drives compress to ~50% of the used space. At 9 GB, I should have a good chance of getting the entire image on a DVD (ideally).

I can only conjecture that my contents are scattered all over the drive with the unused half hidden within the occupied space, defeating the compression process.

Is there a trick to 'defragmenting' an ext3 drive to consolidate its free space? After every ~27 reboots, Suse insists that it do an fsck on a drive. At the end, it always reports that 5%-7% fragmentation.

Wikipedia reports the existence of an 'e2defrag' utility which does not work with on ext3 and states that there may be an online defrag for ext4. Neither Ubuntu 9.04 nor Suse 11.1 contain any executables containing the word 'defrag'. Most references I can find state clearly that ext file systems do not need defragmentation like ntfs file systems do.

I found 1 utility called ShAkE, but it has only 22 reviews and gets a 3/5 rating. The news group articles on it are all from 4+ years ago too. Is there a better compression utility for system images? And, mondoarchive is out of the question because it rarely produces anything other than a large error file and has never, ever restored an image on my 64 bit hardware.

I an running 7Zip on the image now. At 22% complete after 2 hours and has written 3.7 GB so it looks like it will be about 17 GB too after 9 hours of grinding. Aye Carumba!
date; 7z a -tzip /dat/vulcan.sys.sdc3.20090902.dd.zip /vter/vulcan.sys.sdc3.20090902.dd -mx=9; date

Is there any other explanation other than severe fragmentation which could explain this near total lack of compressibility for a Suse 11.1 system disk formatted to ext3?

Puzzled,

BrianP
 
Old 09-03-2009, 03:05 PM   #2
lakris
Member
 
Registered: Sep 2004
Location: Stockholm, Sweden
Distribution: Ubuntu, RedHat, SuSe, Debian, Slax
Posts: 102

Rep: Reputation: 15
Quote:
Originally Posted by brianpbarnes View Post
Is there any other explanation other than severe fragmentation which could explain this near total lack of compressibility for a Suse 11.1 system disk formatted to ext3?

Puzzled,

BrianP
Hi, I'm not entirely sure why You shouldn't be able to compress it more, after all it's just bits and bytes and don't have any exp with defrag, I have never come across a situation where I thought it would be needed.

But one thing I do know is that all Your unused space also contains bits and bytes, they're probably mostly deleted files really. The inode (file system entry) has been removed but the data still remain on disk until it's overwritten by some new file.
But You could try this before dd'ing the disk:

Code:
dd if=/dev/zero of=/crap bs=20971520
rm /crap
It will create a file using up all Your disk, writing a zero to all "free" space. When done and deleted, all free space will have the same byte and better chances to be more compressed.

Best regards,
Lakris

PS Awaiting statistics

Last edited by lakris; 09-03-2009 at 03:06 PM.
 
Old 09-03-2009, 09:08 PM   #3
brianpbarnes
Member
 
Registered: Dec 2005
Posts: 143

Original Poster
Rep: Reputation: 15
Free space is available, but not empty!

Lakris,
That explains it exactly. It copied all of the junk which the file system lists as available space.

I am still surprised by 2 things, though.

1) The disk has never had more than roughly half of it used. It's a system disk. Set it up once and forget it for 3 years, except for software updates. It's hard to believe that I had as much volume of updates as original, DVD code. I do a fresh format with every install, but that may not zero the reused space. Possible.

2) I always use max compression when making system images and I almost always see a ~2:1 compression with N. Ghost, Acronis, Reflect, tar -cz, mondo, etc. I would have expected system junk to be crunched by half, not by 10%.

I just wrote all but 1 GB with your dd /dev/zero trick and df showed 0% free. It also wrote rather slowly at 52.4 MB/s (without reading anything), not the 68 reported by hdparm -t /dev/sdc. It must have been stuffing nulls in every last little hole. Evidence of fragmentation or perhaps just writing to the slower, inner cylinders?

I still have a total backup and will just nuke the failed compression attempts. At $0.10 / GB, it's a quarter worth of disk space. I was just curious as to why it would not compress. I have a pipe dream of being able to compress my system images to fit on 1 DVD. I will try compressing another image at the next kernel update after using this cool procedure again.

Brilliant!!

Thanks 1E6,

BrianP

P.S. Final score between Bzip2 and 7Zip at max compression:
-rw-r--r-- 1 root root 18428172417 2009-09-02 21:38 vulcan.sys.sdc3.20090902.dd.bz2
-rw-r--r-- 1 root root 18336066846 2009-09-03 20:37 vulcan.sys.sdc3.20090902.dd.zip
7Zip wins in size by 0.5%!
About 9 hours for 7Zip, ?? (but similar) for Bzip2.
 
Old 09-04-2009, 07:36 PM   #4
brianpbarnes
Member
 
Registered: Dec 2005
Posts: 143

Original Poster
Rep: Reputation: 15
Difficulty compressing raw DD data.

Hi,
>> I'm not entirely sure why You shouldn't be able to compress it more <<

Perhaps, it is the cost of looking for patterns within 18 GB vs looking for patterns in a single .c or .h file?

With any zip program, there is a limited number of detected patterns which can be stored in the available memory. Even during gargantuan zips, the memory used is not linearly proportional to object size. If it were, the real and virtual memory size would limit how large an object a machine could compress. I think that, given a long enough time, any machine could [try to] compress anything.

Compressing individual files rather than huge, raw devices, you look for patterns is vastly less data. With a finite pattern hash and a small data set, you get a better hit rate and, therefor, higher compression.

Compressing a kernel C file, you find scads of identical chunks of text which can be reduced to a single symbol. Some HTML and XML files can be compressed 90%+. Looking for patterns in JPG files, on the other hand, frequently backfires and creates an archive larger than the original.

The more symbols you have, the longer it takes you to search for one and the response time can be geometrically longer. Hashes, for instance, suffer when they frequently have to regenerate a hash algorithm due to a value collision from different keys. As the AlGore gets harrier, search times can increase radically.


So:
= In large objects, more symbols can be found than stored.
= Response times increase with an increasing number of discovered patterns.
= Pattern tables which overflow physical memory and spill over into virtual slow things down by orders of magnitude.
= Having garbage filling half of your data space can only confound compression.
= In file by file compression, determining whether or not to compress a file can largely be determined by file extension. .C yes, .JPG no.

>>=-> Very large, raw objects are harder to compress than smaller, structured ones.

Removing the noise of deleted data and replacing it with highly compressible NULLs should result in a 2:1 compression from 18 to 9 GB. That's my prediction. I will report back to close this thread after the next kernel update.

Fascinating,

BrianP
 
Old 09-04-2009, 08:03 PM   #5
Elv13
Member
 
Registered: Apr 2006
Location: Montreal,Quebec
Distribution: Gentoo
Posts: 825

Rep: Reputation: 129Reputation: 129
Just compressing the disk with tar -cjfp normally give a better file size than using dd + tar + bz2. It do the same thing, keep permissions, symlinks and files. It can restore boot partitions without any trouble and will scale well if you resize your partition (dd will not). It just work, even for / partition.

tar -xpf will correctly unzip the partition content if you cd in the mount point.
 
Old 09-05-2009, 10:22 AM   #6
brianpbarnes
Member
 
Registered: Dec 2005
Posts: 143

Original Poster
Rep: Reputation: 15
Which is faster, tar or dd?

Elv,
The reason I use dd is that it is so darn fast and there is no monkeying around with links, time stamps, permissions, ownership, ... I just installed a new, 1TB drive so the disk space is a non-issue. I was puzzled as to why the image was so incompressible out of purely scientific curiosity.

My main server is off the air and I am waiting impatiently while this happens. I don't bother to compress it here; I wait for a full boot (not live cd boot) so I can do it in the background while all of my servers are running.

Dd gives you very nearly hardware speed, but tar, without compressing, would only move half as much data in my case (du shows 50% free).

It would be an interesting test of raw, write speed.

I'll do both the next time I reboot. Thanks for the incantations.

BrianP
 
Old 09-05-2009, 10:09 PM   #7
Elv13
Member
 
Registered: Apr 2006
Location: Montreal,Quebec
Distribution: Gentoo
Posts: 825

Rep: Reputation: 129Reputation: 129
The problem with tar is threading, it run only in 1 thread. If it was correctly multithreaded, it would be as fast as dd. I had that on a dual-xeon (i7 based) and by sliting /usr in 3 file and all other in one, I had similar performances to dd and much smaller files. That without rebooting (I excluded the folder where image are stored from the backup, for oblivious reasons).
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Recovery formatted data with ext3 disk? amaj1407 Linux - Hardware 6 07-22-2008 11:24 AM
distributed file system... can we do this effectively using linux xxx_anuj_xxx Linux - Server 2 04-01-2008 05:31 PM
the on-disk structure of the ext3 file system SeRGeiSarov Linux - Software 1 06-04-2007 01:00 PM
raw sector reading from an ext3-formatted USB disk on Windows XP i.you Linux - Software 1 02-22-2007 07:23 PM
transparantly compress files on ext3 nielchiano Linux - General 2 02-04-2004 06:38 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 08:12 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration