question about checksums and data compression

darinbolson · 10-14-2006, 10:28 PM

Just a thought. If you can take an image file, lets say its an iso you want to burn on a dvd, and run a checksum on that file, and arrive at one very specific number, why can't the same be done in reverse?
Couldn't you create a program that would take an md5sum text file and build your iso out of that? This would be fantastic data compression! Imagine downloading the latest version of your favorite distro in under a second!

btmiller · 10-15-2006, 12:35 AM

Unfortunately, life doesn't work that way. An MD5 hash is 128 bits -- much smaller than most files on a computer. This means that multiple files have the same MD5 sum. In other words, these hash functions are susceptible to collisions (two files having the same checksum). I believe there are theoretically an infinite number of files with the same MD5 hash (I am not a mathemetician though). How is your "reconstructor" program supposed to know which of these infinite number to reconstruct?

Also, getting an MD5 hash from a file is kind of like turning a cow into a bunch of steaks, hamburgers, etc. It's essentially a one way process. You can't take a bunch of hamburger patties and steaks and reconstruct a cow, and the same applies for constructing a file from a hash function.

Actually, that's a bit of an oversimplification. You *can* construct a file with a particular MD5 hash, but AFAIK it requires brute forcing (generating random files and seeing if they match the hash). This takes a long time. For a 128 bit hash, there are 2^128 possible hashes, and therefore on average it will take approximately 2^64 tries to generate a hash (let's assume we know how big the file is supposed to be, so we only have to generate random files of the correct length). Support it takes 1/100th of a second to generate each possibility. If I did my math right, it would take about 5.8 billion years on average to find a file with the correct MD5 sum, and even then, as mentioned above, we don't even know if it will be the right file (because of hash collisions). Kind of makes moot the one second it took to download your favorite distro

.

Note: as mentioned above, it's been years since I sat in a graduate level discrete math class, so the above "back of the envelope" calculation may be horribly wrong. If so someone with more math smarts than I can hopefully correct it.

Edit to add: I recommend having a look at the MD5 article on Wikipedia for more than you probably ever wanted to know about the MD5 algorithm.

darinbolson · 10-15-2006, 11:37 PM

Ok, so using an md5sum would not be the way to go about it. I still think that there must be a way to rebuild the iso out of a very small file with two parts. One would be something similar to an md5sum, and the other is a list of operations done to it to arrive at that number. Our download time will now be over one second, but I can live with that.