Skip to main content

compression - Why does zipping a zipped file not reduce its size?


Based on the idea that a zipped file is a new binary file, why can't I reduce a Zip's size by zipping it again and again – up to a very small resulting file?



Answer




Based on the idea that a zipped file is a new binnary file, why I can't reduce it's size by zipping it again and successively up to a very small file?



Because compression works on the basis of finding patterns and reducing data that is similar.


For example, RLE (Run-length Encoding) is a simple compression method where data is examined and runs of similar data are compressed down as so:


AAABCEEEJFFYYYYYYYYYYOOAAAAGGGGGAAA

becomes

3ABC3EJ2F10YOO4A5G3A

As you can see, by replacing repeated data with just the data and a count of how many times it occurs, you can reduce this specific example from 35 bytes, down to 20 bytes. That’s not a huge reduction, but it’s still 42% smaller. Moreover, this is a small, contrived example; larger, real-life examples could have even better compression. (The OO was left alone because replacing it with 2O would not save anything.)


Text files often compress really well because they tend to have a lot of patterns that can be compressed. For example, the word the is very common in English, so you could drop every single instance of the word with an identifier that is just single byte (or even less). You can also compress more with parts of words that are similar like cAKE, bAKE, shAKE, undertAKE, and so on.


So why can’t you compress a file that’s already compressed? Because when you did the initial compression, you removed the patterns.


Look at the compressed RLE example. How can you compress that further? There are no runs of identical data to compress. In fact, often when you try to compress a file that’s already compressed, you could end up with a larger file. For example, if you forced the above example to be re-encoded, you might end up with something like this:


131A1B1C131E1J121F11101Y2O141A151G131A

Now, the compression data (the run-counts) are themselves being treated like data, so you end up with a larger file than you started with.


What you could try is to use a different compression algorithm because it is possible that the output of one compression algorithm could possibly be prime for a different algorithm, however that is usually pretty unlikely.


Of course, this is all about lossless compression where the decompressed data must be exactly identical to the original data. With lossy compression, you can usually remove more data, but the quality goes down. Also, lossy compression usually uses some sort of pattern-based scheme (it doesn’t only discard data), so you will still eventually reach a point where there are simply no patterns to find.


Comments

Popular Posts

How do I transmit a single hexadecimal value serial data in PuTTY using an Alt code?

I am trying to sent a specific hexadecimal value across a serial COM port using PuTTY. Specifically, I want to send the hex codes 9C, B6, FC, and 8B. I have looked up the Alt codes for these and they are 156, 182, 252, and 139 respectively. However, whenever I input the Alt codes, a preceding hex value of C2 is sent before 9C, B6, and 8B so the values that are sent are C2 9C, C2 B6, and C2 8B. The value for FC is changed to C3 FC. Why are these values being placed before the hex value and why is FC being changed altogether? To me, it seems like there is a problem internally converting the Alt code to hex. Is there a way to directly input hex values without using Alt codes in PuTTY? Answer What you're seeing is just ordinary text character set conversion. As far as PuTTY is concerned, you are typing (and reading) text , not raw binary data, therefore it has to convert the text to bytes in whatever configured character set before sending it over the wire. In other words, when y...

linux - Extract/save a mail attachment using bash

Using normal bash tools (ie, built-ins or commonly-available command-line tools), is it possible, and how to extract/save attachments on emails? For example, say I have a nightly report which arrives via email but is a zip archive of several log files. I want to save all those zips into a backup directory. How would I accomplish that? Answer If you're aiming for portability, beware that there are several different versions of mail(1) and mailx(1) . There's a POSIX mailx command, but with very few requirements. And none of the implementations I have seem to parse attachments anyway. You might have the mpack package . Its munpack command saves all parts of a MIME message into separate files, then all you have to do is save the interesting parts and clean up the rest. There's also metamail . An equivalent of munpack is metamail -wy .

performance - Single Threaded Qaud Core v.s Hyper-Threading Dual Core

Let's say we have two CPUs, One is Quad Core 3.2 Ghz with 4 Cores, and We have a Dual Core 3.2 Ghz with 2 Cores with 2 threads in each Core (Hyper-Threading). My assumption as a programmer will be, the 4 cores 4 threads should perform faster than 2 cores 4 threads since the second CPU needs to switch between threads in order to emulate 4 cores while the first one doesn't need to perform such switching as each core can perform independently and individually. I want to confirm that my assumption is true, if not please explain why one is better than the other. Answer I do believe thats true - since hyper threading does share some elements - specifically the main execution resources, you'll be able to run 4 full threads at once, rather than waiting for those resources to be freed up. The point of HT is to get better performance with a smaller use of die area - your quad core would generally be a bigger chip - say almost twice as large, than a non HT dual core chip, while a HT...

freeze - How do I stop windows 8.1 from freezing when the screen locks

This happens to me on a regular basis if I leave the computer for upwards of 10 minutes. It didnt do so at first but started after a couple of days. This is possibly related to further windows updates although nothing seems to tie in obviously when looking at my update history. I have to hold the power button in to power off. If the screens have switched off aswell they wont come back on, if they haven't I see the login picture and can move the mouse pointer but nothing happens and no combination of keyboard mashes or mouse clicks lets me see the login prompt. In the event log (type event viewer into the start menu) under system before every Critical problem (me powering down the machine without restarting) I get distributedCOM errors talking about this guid: "The server {BF6C1E47-86EC-4194-9CE5-13C15DCB2001} did not register with DCOM within the required timeout." I also get the same error for this 1B1F472E-3221-4826-97DB-2C2324D389AE. This seems to be a common theme and...