LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > General
User Name
Password
General This forum is for non-technical general discussion which can include both Linux and non-Linux topics. Have fun!

Notices


Reply
  Search this Thread
Old 07-19-2010, 03:16 PM   #1
irmin
Member
 
Registered: Jan 2010
Location: the universe
Distribution: Slackware (modified), Slackware64 (modified), openSuSE (modified)
Posts: 342

Rep: Reputation: 62
Quiz: Why did gzip, bzip2 and lzma fail on this piece of data? What is it?


I had a data file (exactly 64 Mb) on my laptop. I tried to compress it with gzip, bzip2 and lzma (all with highest possible compression rate), but none of the programs were able to shrink the size of the file. Instead they increased size: LZMA created the largest "compressed" file.

But here is the output of "ls -l":
Code:
user@laptop:~> ls data* -l
-rw-r--r-- 1 user users 67108864 Mar 11 12:34 data
-rw-r--r-- 1 user users 67405135 Jul 19 15:10 data.bz2
-rw-r--r-- 1 user users 67119122 Jul 19 15:10 data.gz
-rw-r--r-- 1 user users 68020325 Jul 19 15:09 data.lzma
The file "data" is the original. The suffix tells you what program was used to compress "data". As you can see all "compressed" files are larger than the original.

The questions are now:
- Why did all compression programs fail on this strange data file?
- What type of data is contained in "data"?

The answer is not that hard to find, but also not unique ...
Let's see how long it takes until someone solves this problem ...

Added:
If someone wants to test how his/her favorite compression programs performs on this data, just tell.

Last edited by irmin; 07-19-2010 at 03:21 PM.
 
Old 07-19-2010, 03:24 PM   #2
Alexvader
Member
 
Registered: Oct 2009
Location: Japan
Distribution: Arch, Debian, Slackware
Posts: 994

Rep: Reputation: 94
Hi...

When these apps try to compress something they build a "dictionary" where the most frequent bit streams of length n are mapped into a code with a tiny representation... statistically infrequent bit streams conversely are mapped to big representations.

My first first guess is that your "data" is "statistically flat", so that no further optimizations are possible, yet, the dictionary is actually built and appended to the mapped output of the substitution so as to revert it in the decompression, increasing the overall size of the compressed data.

Data is probably either a program/library ( binary stuff ) or encrypted stuff ( also known to be statistically flat)

Last edited by Alexvader; 07-19-2010 at 03:26 PM.
 
Old 07-19-2010, 03:33 PM   #3
irmin
Member
 
Registered: Jan 2010
Location: the universe
Distribution: Slackware (modified), Slackware64 (modified), openSuSE (modified)
Posts: 342

Original Poster
Rep: Reputation: 62
Yeah, but binary data normally contains lots of structure (many zeros/nops for padding and regular opcodes for the target machine). So you should be able to compress a program quite good.

Furthermore I can tell you that this data is not encrypted.

But with "statistically flat" you're on the right way ...
 
Old 07-19-2010, 05:08 PM   #4
Alexvader
Member
 
Registered: Oct 2009
Location: Japan
Distribution: Arch, Debian, Slackware
Posts: 994

Rep: Reputation: 94
white noise...?

Gaussian noise is not flat...
 
Old 07-20-2010, 12:17 PM   #5
irmin
Member
 
Registered: Jan 2010
Location: the universe
Distribution: Slackware (modified), Slackware64 (modified), openSuSE (modified)
Posts: 342

Original Poster
Rep: Reputation: 62
Ok, I guess this quiz was too easy ....

In fact this data were just random data. I calculated byte value rate in the data file and all characters occurred with the same rate. I do not know whether /dev/urandom generates white noise-like signals (I know that a white noise signal has a uniform spectral density, but I cannot tell if this applies to the byte stream originating in /dev/urandom).

Is there a way to tell whether this data-file is compressible by some algorithm at all? (In the sense of Kolmogorov complexity, I also know that this complexity function cannot be calculated except for finite many cases, but is it possible to give an upper bound for the complexity?).

Also can anybody explain how the random byte stream in /dev/random (not /dev/urandom) is generated in detail?
 
Old 07-20-2010, 12:31 PM   #6
Alexvader
Member
 
Registered: Oct 2009
Location: Japan
Distribution: Arch, Debian, Slackware
Posts: 994

Rep: Reputation: 94
/dev/urandom is a congruential class random number generator, although the specifics of its implementation is unknown to me...

Maybe hack into glibc source code...?

http://en.wikipedia.org/wiki/Linear_...tial_generator

This one is more "random"

http://en.wikipedia.org/wiki/Linear_...shift_register

Although this http://www.schneier.com/paper-twoprime.pdf seems a rather stron pseudorandom number generator, is has been reported to havs theoretical successfull attacks...
 
Old 07-20-2010, 12:59 PM   #7
smeezekitty
Senior Member
 
Registered: Sep 2009
Location: Washington U.S.
Distribution: M$ Windows / Debian / Ubuntu / DSL / many others
Posts: 2,339

Rep: Reputation: 231Reputation: 231Reputation: 231
100% random.
 
Old 07-20-2010, 02:06 PM   #8
irmin
Member
 
Registered: Jan 2010
Location: the universe
Distribution: Slackware (modified), Slackware64 (modified), openSuSE (modified)
Posts: 342

Original Poster
Rep: Reputation: 62
@Alexvader: Thanks for the links. There is no "Thanks" button there, so I cannot click on it ...

I looked at drivers/char/random.c in the kernel source, where the /dev/random and /dev/urandom devices are implemented and this file contains enough information on how the random numbers are generated: They are not pseudo-random-numbers, but "true" random numbers, which are generated from a non deterministic source (inter-keyboard and inter-interrupt timings) together with some CRC-like mixing and SHA hashing. Thus I guess that these are much better "random numbers" than PRNG can generate.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
backup script multiple gzips into one bzip2 or gzip z01krh Linux - Newbie 2 02-19-2010 10:22 AM
gzip but keep the original file, and bzip2 - multiple files to one file? SirTristan Linux - Newbie 4 10-20-2009 03:37 AM
CPAN.pm needs either the external programs tar, gzip and bzip2 anndy Linux - Newbie 4 09-27-2008 07:09 AM
bzip2 and gzip ust Linux - Software 1 02-08-2005 10:26 AM
bzip2 vs.gzip mikeshn Linux - Software 2 09-20-2003 09:30 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > General

All times are GMT -5. The time now is 06:48 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration