I've been wondering about this for awhile now: does random data compress?

Some people seem to think it does, while others maintain that it doesn't. I mean to mathematically prove that only 1/255 of random files can compress.

Conditions:

Original file size: 32 bytes
Maximum size of "compressed" file: 31 bytes
Number of possible different original files: 1.158 * 10^77
Number of possilbe "compressed" files: let's calculate that

Uhoh, this is gonna be hard to do without summation sybols.... well, ASCII art never hurt anyone.

31_
\
/__ 1.158 * 10^77 * (1/256)^k = 4.540 * 10^74
k=1

(1.158 * 10^77) / (4.540 * 10^74) ~ 1/255

Does this or does this not prove that only 1/255 of random data files can be compressed, and is my math correct?

Recommended Answers

All 9 Replies

I've been wondering about this for awhile now: does random data compress?
...
Does this or does this not prove that only 1/255 of random data files can be compressed, and is my math correct?

That sum is a lot more complicated than it has to be. It turns out to be equivalent to sum[k=1..31](256^k). Also, it looks like you've transposed the numerator and denominator in the last step, though your result is correct.

You have the ratio for a sequence of 32 bytes; the general solution for the ratio of shorter to full-length sequences is (1/255) - (1/((256^(n-1))*255)). For sequences of very short length, this ratio varies quite a bit, but the second term quickly approaches zero, so for any practical sequence of data (who would want to compress three bytes?), the ratio is 1/255 minus a very small constant.

What has actually been shown is that there are, in general, approximately only 1/255 as many possible shorter sequences, which is more of an upper bound--a given compression algorithm will be able to compress randomly-generated sequences of bytes at most 1/255 of the time, but realistically this will happen even less often.

If you derive the ratio for bits instead of bytes, it is 1-(1/(2^(n-1))), which is much more agreeable. However, on a machine with 8-bit bytes, any 32-byte sequence (for example) that compresses to 31 bytes plus a few bits would have to be padded to 32 bits anyway, effectively losing the compression--so the bit-based ratio doesn't contradict the above math.

I could post the derivation if anyone's interested.
--sg

I would have said what you just said if it hadn't been 1:45 AM ;) . Well, I'm gonna try making random data and seeing if winzip will compress it. It will be interesting to see if winzip can do the impossible... :)

Hrm.... truly random data is harder to find than I thought. <grumbles>

Have you tried an encryption routine? They're designed to produce pretty-darn-random numbers so that statistical analisys of the result won't yield clues to the input. There are some industrial strength random number generators, or you can start with something like an SHA-1 one-way hash.

Here's some code to grab:

http://www.codeproject.com/cpp/csha1.asp

If you want some big texts to use as a seed, you could grab some text from project gutenberg (sp?) like 'Romeo and Juliet', then feed that text into the SHA-1 code and poof you have a pretty darn random file!

Thanks for the idea Chainsaw. I encrypted a bitmap with PGP and deleted the header to leave a file of basically random data. Winzip couldn't compress it at all (It got bigger by a couple hundred bytes, actually). I guess random data doesn't compress... I was right. :)

Hrm.... truly random data is harder to find than I thought. <grumbles>

lol.
Winzip doing a pretty good job, huh?

lol Truly random data wont compress if every bit of it is different. The 'break even' case is if at least half of the data bits occurred twice or more. Then RLE or a Look Up method would JUST about reduce the size

:)

You could always write:

Summation (for k=1 to 31, 1.158 * 10^77 * (1/256)^k = 4.540 * 10^74)

I have used that notation for years. For a continued product, use:

ContProduct (for i=1 to m, i+1)

A limit, and a one-sided limit:

Limit(a goes to b, a^2)

Limit(a goes up to b, a^2)

Algebraic random numbers are not very random. I usually write random number generators by finding the natural antilog e^x of some value and clipping off the leading 10 digits.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.