Quiz: Why did gzip, bzip2 and lzma fail on this piece of data? What is it?
GeneralThis forum is for non-technical general discussion which can include both Linux and non-Linux topics. Have fun!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Quiz: Why did gzip, bzip2 and lzma fail on this piece of data? What is it?
I had a data file (exactly 64 Mb) on my laptop. I tried to compress it with gzip, bzip2 and lzma (all with highest possible compression rate), but none of the programs were able to shrink the size of the file. Instead they increased size: LZMA created the largest "compressed" file.
But here is the output of "ls -l":
Code:
user@laptop:~> ls data* -l
-rw-r--r-- 1 user users 67108864 Mar 11 12:34 data
-rw-r--r-- 1 user users 67405135 Jul 19 15:10 data.bz2
-rw-r--r-- 1 user users 67119122 Jul 19 15:10 data.gz
-rw-r--r-- 1 user users 68020325 Jul 19 15:09 data.lzma
The file "data" is the original. The suffix tells you what program was used to compress "data". As you can see all "compressed" files are larger than the original.
The questions are now:
- Why did all compression programs fail on this strange data file?
- What type of data is contained in "data"?
The answer is not that hard to find, but also not unique ...
Let's see how long it takes until someone solves this problem ...
Added:
If someone wants to test how his/her favorite compression programs performs on this data, just tell.
When these apps try to compress something they build a "dictionary" where the most frequent bit streams of length n are mapped into a code with a tiny representation... statistically infrequent bit streams conversely are mapped to big representations.
My first first guess is that your "data" is "statistically flat", so that no further optimizations are possible, yet, the dictionary is actually built and appended to the mapped output of the substitution so as to revert it in the decompression, increasing the overall size of the compressed data.
Data is probably either a program/library ( binary stuff ) or encrypted stuff ( also known to be statistically flat)
Yeah, but binary data normally contains lots of structure (many zeros/nops for padding and regular opcodes for the target machine). So you should be able to compress a program quite good.
Furthermore I can tell you that this data is not encrypted.
But with "statistically flat" you're on the right way ...
In fact this data were just random data. I calculated byte value rate in the data file and all characters occurred with the same rate. I do not know whether /dev/urandom generates white noise-like signals (I know that a white noise signal has a uniform spectral density, but I cannot tell if this applies to the byte stream originating in /dev/urandom).
Is there a way to tell whether this data-file is compressible by some algorithm at all? (In the sense of Kolmogorov complexity, I also know that this complexity function cannot be calculated except for finite many cases, but is it possible to give an upper bound for the complexity?).
Also can anybody explain how the random byte stream in /dev/random (not /dev/urandom) is generated in detail?
Although this http://www.schneier.com/paper-twoprime.pdf seems a rather stron pseudorandom number generator, is has been reported to havs theoretical successfull attacks...
@Alexvader: Thanks for the links. There is no "Thanks" button there, so I cannot click on it ...
I looked at drivers/char/random.c in the kernel source, where the /dev/random and /dev/urandom devices are implemented and this file contains enough information on how the random numbers are generated: They are not pseudo-random-numbers, but "true" random numbers, which are generated from a non deterministic source (inter-keyboard and inter-interrupt timings) together with some CRC-like mixing and SHA hashing. Thus I guess that these are much better "random numbers" than PRNG can generate.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.