954,198 Members — Technology Publication meets Social Media
Username:
Password:
Lost login information?
Have something to say? Contribute New Article Reply to this Article

Does random data compress?

I've been wondering about this for awhile now: does random data compress?

Some people seem to think it does, while others maintain that it doesn't. I mean to mathematically prove that only 1/255 of random files can compress.

Conditions:

Original file size: 32 bytes
Maximum size of "compressed" file: 31 bytes
Number of possible different original files: 1.158 * 10^77
Number of possilbe "compressed" files: let's calculate that

Uhoh, this is gonna be hard to do without summation sybols.... well, ASCII art never hurt anyone.

31_
\
/__ 1.158 * 10^77 * (1/256)^k = 4.540 * 10^74
k=1

(1.158 * 10^77) / (4.540 * 10^74) ~ 1/255

Does this or does this not prove that only 1/255 of random data files can be compressed, and is my math correct?

Toba
Junior Poster
192 posts since Jun 2004
Reputation Points: 115
Solved Threads: 5
 
I've been wondering about this for awhile now: does random data compress? ... Does this or does this not prove that only 1/255 of random data files can be compressed, and is my math correct?

That sum is a lot more complicated than it has to be. It turns out to be equivalent tosum[k=1..31](256^k). Also, it looks like you've transposed the numerator and denominator in the last step, though your result is correct.

You have the ratio for a sequence of 32 bytes; the general solution for the ratio of shorter to full-length sequences is (1/255) - (1/((256^(n-1))*255)). For sequences of very short length, this ratio varies quite a bit, but the second term quickly approaches zero, so for any practical sequence of data (who would want to compress three bytes?), the ratio is 1/255 minus a very small constant.

What has actually been shown is that there are, in general, approximately only 1/255 as many possible shorter sequences, which is more of an upper bound--a given compression algorithm will be able to compress randomly-generated sequences of bytes at most 1/255 of the time, but realistically this will happen even less often.

If you derive the ratio for bits instead of bytes, it is 1-(1/(2^(n-1))), which is much more agreeable. However, on a machine with 8-bit bytes, any 32-byte sequence (for example) that compresses to 31 bytes plus a few bits would have to be padded to 32 bits anyway, effectively losing the compression--so the bit-based ratio doesn't contradict the above math.

I could post the derivation if anyone's interested.
--sg

gusano79
Posting Pro
519 posts since May 2004
Reputation Points: 182
Solved Threads: 77
 

I would have said what you just said if it hadn't been 1:45 AM ;) . Well, I'm gonna try making random data and seeing if winzip will compress it. It will be interesting to see if winzip can do the impossible... :)

Toba
Junior Poster
192 posts since Jun 2004
Reputation Points: 115
Solved Threads: 5
 

Hrm.... truly random data is harder to find than I thought.

Toba
Junior Poster
192 posts since Jun 2004
Reputation Points: 115
Solved Threads: 5
 

Have you tried an encryption routine? They're designed to produce pretty-darn-random numbers so that statistical analisys of the result won't yield clues to the input. There are some industrial strength random number generators, or you can start with something like an SHA-1 one-way hash.

Here's some code to grab:

http://www.codeproject.com/cpp/csha1.asp

If you want some big texts to use as a seed, you could grab some text from project gutenberg (sp?) like 'Romeo and Juliet', then feed that text into the SHA-1 code and poof you have a pretty darn random file!

Chainsaw
Posting Pro in Training
436 posts since Jun 2004
Reputation Points: 36
Solved Threads: 11
 

Thanks for the idea Chainsaw. I encrypted a bitmap with PGP and deleted the header to leave a file of basically random data. Winzip couldn't compress it at all (It got bigger by a couple hundred bytes, actually). I guess random data doesn't compress... I was right. :)

Toba
Junior Poster
192 posts since Jun 2004
Reputation Points: 115
Solved Threads: 5
 
Hrm.... truly random data is harder to find than I thought.

lol.
Winzip doing a pretty good job, huh?

Nuez_Jr
Newbie Poster
18 posts since Oct 2004
Reputation Points: 10
Solved Threads: 1
 

lol Truly random data wont compress if every bit of it is different. The 'break even' case is if at least half of the data bits occurred twice or more. Then RLE or a Look Up method would JUST about reduce the size

:)

1o0oBhP
Posting Pro in Training
445 posts since Dec 2004
Reputation Points: 16
Solved Threads: 6
 

You could always write:

Summation (for k=1 to 31, 1.158 * 10^77 * (1/256)^k = 4.540 * 10^74)

I have used that notation for years. For a continued product, use:

ContProduct (for i=1 to m, i+1)

A limit, and a one-sided limit:

Limit(a goes to b, a^2)

Limit(a goes up to b, a^2)

Real-tiner
Posting Whiz in Training
207 posts since Dec 2004
Reputation Points: 11
Solved Threads: 8
 

Algebraic random numbers are not very random. I usually write random number generators by finding the natural antilog e^x of some value and clipping off the leading 10 digits.

Real-tiner
Posting Whiz in Training
207 posts since Dec 2004
Reputation Points: 11
Solved Threads: 8
 

This article has been dead for over three months

Post: Markdown Syntax: Formatting Help
You