Based on the idea that a zipped file is a new binary file, why can't I reduce a Zip's size by zipping it again and again – up to a very small resulting file?
Answer
Based on the idea that a zipped file is a new binnary file, why I can't reduce it's size by zipping it again and successively up to a very small file?
Because compression works on the basis of finding patterns and reducing data that is similar.
For example, RLE (Run-length Encoding) is a simple compression method where data is examined and runs of similar data are compressed down as so:
AAABCEEEJFFYYYYYYYYYYOOAAAAGGGGGAAA
becomes
3ABC3EJ2F10YOO4A5G3A
As you can see, by replacing repeated data with just the data and a count of how many times it occurs, you can reduce this specific example from 35 bytes, down to 20 bytes. That’s not a huge reduction, but it’s still 42% smaller. Moreover, this is a small, contrived example; larger, real-life examples could have even better compression. (The OO
was left alone because replacing it with 2O
would not save anything.)
Text files often compress really well because they tend to have a lot of patterns that can be compressed. For example, the word the is very common in English, so you could drop every single instance of the word with an identifier that is just single byte (or even less). You can also compress more with parts of words that are similar like cAKE
, bAKE
, shAKE
, undertAKE
, and so on.
So why can’t you compress a file that’s already compressed? Because when you did the initial compression, you removed the patterns.
Look at the compressed RLE example. How can you compress that further? There are no runs of identical data to compress. In fact, often when you try to compress a file that’s already compressed, you could end up with a larger file. For example, if you forced the above example to be re-encoded, you might end up with something like this:
131A1B1C131E1J121F11101Y2O141A151G131A
Now, the compression data (the run-counts) are themselves being treated like data, so you end up with a larger file than you started with.
What you could try is to use a different compression algorithm because it is possible that the output of one compression algorithm could possibly be prime for a different algorithm, however that is usually pretty unlikely.
Of course, this is all about lossless compression where the decompressed data must be exactly identical to the original data. With lossy compression, you can usually remove more data, but the quality goes down. Also, lossy compression usually uses some sort of pattern-based scheme (it doesn’t only discard data), so you will still eventually reach a point where there are simply no patterns to find.
Comments
Post a Comment