How to effectively compress a 'dd' system disk image formatted to ext3?
Linux - SoftwareThis forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
How to effectively compress a 'dd' system disk image formatted to ext3?
Hi,
The only time I reboot my Suse 11.1 server is when a kernel update requires it. When I reboot, I sometimes boot from the Ubuntu 'live cd' because it has dd on it. I cd to /dev/disk/by_id, find the ID of a suitable data drive, make an fstab record for it and mount it.
Dd is great in that it gives you very nearly hardware max speed with Raptor 74s reporting 68 MB/s copying a device (my root drive) to a file on a much larger data drive.
Df on the slash drive reports 18 GB total and ~9 GB free. Dd records the entire 18 GB. When I compress it with:
bzip2 --stdout --compress --best --keep /vter/vulcan.sys.sdc3.20090902.dd > /dat/vulcan.sys.sdc3.20090902.dd.bz2
it only compresses down to 17.2 GB. This is after hours of grinding, even though 9 GB of it is empty space and bzip2 is compressing with --BEST. Usually, system drives compress to ~50% of the used space. At 9 GB, I should have a good chance of getting the entire image on a DVD (ideally).
I can only conjecture that my contents are scattered all over the drive with the unused half hidden within the occupied space, defeating the compression process.
Is there a trick to 'defragmenting' an ext3 drive to consolidate its free space? After every ~27 reboots, Suse insists that it do an fsck on a drive. At the end, it always reports that 5%-7% fragmentation.
Wikipedia reports the existence of an 'e2defrag' utility which does not work with on ext3 and states that there may be an online defrag for ext4. Neither Ubuntu 9.04 nor Suse 11.1 contain any executables containing the word 'defrag'. Most references I can find state clearly that ext file systems do not need defragmentation like ntfs file systems do.
I found 1 utility called ShAkE, but it has only 22 reviews and gets a 3/5 rating. The news group articles on it are all from 4+ years ago too. Is there a better compression utility for system images? And, mondoarchive is out of the question because it rarely produces anything other than a large error file and has never, ever restored an image on my 64 bit hardware.
I an running 7Zip on the image now. At 22% complete after 2 hours and has written 3.7 GB so it looks like it will be about 17 GB too after 9 hours of grinding. Aye Carumba!
date; 7z a -tzip /dat/vulcan.sys.sdc3.20090902.dd.zip /vter/vulcan.sys.sdc3.20090902.dd -mx=9; date
Is there any other explanation other than severe fragmentation which could explain this near total lack of compressibility for a Suse 11.1 system disk formatted to ext3?
Is there any other explanation other than severe fragmentation which could explain this near total lack of compressibility for a Suse 11.1 system disk formatted to ext3?
Puzzled,
BrianP
Hi, I'm not entirely sure why You shouldn't be able to compress it more, after all it's just bits and bytes and don't have any exp with defrag, I have never come across a situation where I thought it would be needed.
But one thing I do know is that all Your unused space also contains bits and bytes, they're probably mostly deleted files really. The inode (file system entry) has been removed but the data still remain on disk until it's overwritten by some new file.
But You could try this before dd'ing the disk:
Code:
dd if=/dev/zero of=/crap bs=20971520
rm /crap
It will create a file using up all Your disk, writing a zero to all "free" space. When done and deleted, all free space will have the same byte and better chances to be more compressed.
Lakris,
That explains it exactly. It copied all of the junk which the file system lists as available space.
I am still surprised by 2 things, though.
1) The disk has never had more than roughly half of it used. It's a system disk. Set it up once and forget it for 3 years, except for software updates. It's hard to believe that I had as much volume of updates as original, DVD code. I do a fresh format with every install, but that may not zero the reused space. Possible.
2) I always use max compression when making system images and I almost always see a ~2:1 compression with N. Ghost, Acronis, Reflect, tar -cz, mondo, etc. I would have expected system junk to be crunched by half, not by 10%.
I just wrote all but 1 GB with your dd /dev/zero trick and df showed 0% free. It also wrote rather slowly at 52.4 MB/s (without reading anything), not the 68 reported by hdparm -t /dev/sdc. It must have been stuffing nulls in every last little hole. Evidence of fragmentation or perhaps just writing to the slower, inner cylinders?
I still have a total backup and will just nuke the failed compression attempts. At $0.10 / GB, it's a quarter worth of disk space. I was just curious as to why it would not compress. I have a pipe dream of being able to compress my system images to fit on 1 DVD. I will try compressing another image at the next kernel update after using this cool procedure again.
Brilliant!!
Thanks 1E6,
BrianP
P.S. Final score between Bzip2 and 7Zip at max compression:
-rw-r--r-- 1 root root 18428172417 2009-09-02 21:38 vulcan.sys.sdc3.20090902.dd.bz2
-rw-r--r-- 1 root root 18336066846 2009-09-03 20:37 vulcan.sys.sdc3.20090902.dd.zip
7Zip wins in size by 0.5%!
About 9 hours for 7Zip, ?? (but similar) for Bzip2.
Hi,
>> I'm not entirely sure why You shouldn't be able to compress it more <<
Perhaps, it is the cost of looking for patterns within 18 GB vs looking for patterns in a single .c or .h file?
With any zip program, there is a limited number of detected patterns which can be stored in the available memory. Even during gargantuan zips, the memory used is not linearly proportional to object size. If it were, the real and virtual memory size would limit how large an object a machine could compress. I think that, given a long enough time, any machine could [try to] compress anything.
Compressing individual files rather than huge, raw devices, you look for patterns is vastly less data. With a finite pattern hash and a small data set, you get a better hit rate and, therefor, higher compression.
Compressing a kernel C file, you find scads of identical chunks of text which can be reduced to a single symbol. Some HTML and XML files can be compressed 90%+. Looking for patterns in JPG files, on the other hand, frequently backfires and creates an archive larger than the original.
The more symbols you have, the longer it takes you to search for one and the response time can be geometrically longer. Hashes, for instance, suffer when they frequently have to regenerate a hash algorithm due to a value collision from different keys. As the AlGore gets harrier, search times can increase radically.
So:
= In large objects, more symbols can be found than stored.
= Response times increase with an increasing number of discovered patterns.
= Pattern tables which overflow physical memory and spill over into virtual slow things down by orders of magnitude.
= Having garbage filling half of your data space can only confound compression.
= In file by file compression, determining whether or not to compress a file can largely be determined by file extension. .C yes, .JPG no.
>>=-> Very large, raw objects are harder to compress than smaller, structured ones.
Removing the noise of deleted data and replacing it with highly compressible NULLs should result in a 2:1 compression from 18 to 9 GB. That's my prediction. I will report back to close this thread after the next kernel update.
Just compressing the disk with tar -cjfp normally give a better file size than using dd + tar + bz2. It do the same thing, keep permissions, symlinks and files. It can restore boot partitions without any trouble and will scale well if you resize your partition (dd will not). It just work, even for / partition.
tar -xpf will correctly unzip the partition content if you cd in the mount point.
Elv,
The reason I use dd is that it is so darn fast and there is no monkeying around with links, time stamps, permissions, ownership, ... I just installed a new, 1TB drive so the disk space is a non-issue. I was puzzled as to why the image was so incompressible out of purely scientific curiosity.
My main server is off the air and I am waiting impatiently while this happens. I don't bother to compress it here; I wait for a full boot (not live cd boot) so I can do it in the background while all of my servers are running.
Dd gives you very nearly hardware speed, but tar, without compressing, would only move half as much data in my case (du shows 50% free).
It would be an interesting test of raw, write speed.
I'll do both the next time I reboot. Thanks for the incantations.
The problem with tar is threading, it run only in 1 thread. If it was correctly multithreaded, it would be as fast as dd. I had that on a dual-xeon (i7 based) and by sliting /usr in 3 file and all other in one, I had similar performances to dd and much smaller files. That without rebooting (I excluded the folder where image are stored from the backup, for oblivious reasons).
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.