It isn’t the thing that the KL divergence is measuring, it is an analogy for it. The KL divergence is measuring the amount of informational entropy; strictly speaking, zipping a file has no effect in those terms.
However, we can take those examples more or less intact and place them in informational-entropy terms; the third gets a little weird in the doing, however.
So, having an intuition for what the ZIP file does, the equivalent “examples”:
Example 1: KLE(Reference optimizer output stage, ineffective optimizer output) is 0; KLE(Reference final stage, ineffective optimizer final stage) is also 0. Not appropriate as a way of thinking about this, but helpful to frame the next two examples.
Example 2: KLE(Reference optimizer output stage, antieffective optimizer output) is -N; KLE(Reference final stage, antieffective optimizer final stage) is 0. Supposing an antieffective optimizer and an optimizer can both exist such that a future optimizer optimizes away the inefficiency introduced by the antieffective optimizer, we may observe a violation of the conclusion that, for N steps, adding an optimizer cannot result in a situation such that the KL distance of N+1 must be less than or equal to the KL distance of N, relative to their reference cases.
Example 3: The central conceit to this example is harder to translate; the basic idea is that one optimization can make a future optimization more efficient. Critically for this, the future optimization can vary in effectiveness depending on the input parameters. Suppose, for a moment, a state-reduction algorithm on a set of states n, using something like the opening example:
For n < 64, each state mapping n maps to the state {ceiling (n / 4)} [2 bits of optimization]
For n > 64, each state mapping n maps to the state {ceiling (n / 2)} [1 bit of optimization]
Then, for a probability distribution such that the number of states is greater than 64 but less than 128, a precursor optimization which halves the number of states (1 bit of optimization) will create an additional bit of optimization at this second optimizer.
Or, for a very contrived example, the first “optimizer” does minimal optimization, and mostly just encodes a control sequence into the output which enables the second optimizer to actually optimize.
There’s something here about process-entropy versus data-entropy which I think merits examination, however.
[Also, an observation: Arithmetic is an optimizer. Consider the number of input states and the number of output states of addition.]
It isn’t the thing that the KL divergence is measuring, it is an analogy for it. The KL divergence is measuring the amount of informational entropy; strictly speaking, zipping a file has no effect in those terms.
However, we can take those examples more or less intact and place them in informational-entropy terms; the third gets a little weird in the doing, however.
So, having an intuition for what the ZIP file does, the equivalent “examples”:
Example 1: KLE(Reference optimizer output stage, ineffective optimizer output) is 0; KLE(Reference final stage, ineffective optimizer final stage) is also 0. Not appropriate as a way of thinking about this, but helpful to frame the next two examples.
Example 2: KLE(Reference optimizer output stage, antieffective optimizer output) is -N; KLE(Reference final stage, antieffective optimizer final stage) is 0. Supposing an antieffective optimizer and an optimizer can both exist such that a future optimizer optimizes away the inefficiency introduced by the antieffective optimizer, we may observe a violation of the conclusion that, for N steps, adding an optimizer cannot result in a situation such that the KL distance of N+1 must be less than or equal to the KL distance of N, relative to their reference cases.
Example 3: The central conceit to this example is harder to translate; the basic idea is that one optimization can make a future optimization more efficient. Critically for this, the future optimization can vary in effectiveness depending on the input parameters. Suppose, for a moment, a state-reduction algorithm on a set of states n, using something like the opening example:
For n < 64, each state mapping n maps to the state {ceiling (n / 4)} [2 bits of optimization]
For n > 64, each state mapping n maps to the state {ceiling (n / 2)} [1 bit of optimization]
Then, for a probability distribution such that the number of states is greater than 64 but less than 128, a precursor optimization which halves the number of states (1 bit of optimization) will create an additional bit of optimization at this second optimizer.
Or, for a very contrived example, the first “optimizer” does minimal optimization, and mostly just encodes a control sequence into the output which enables the second optimizer to actually optimize.
There’s something here about process-entropy versus data-entropy which I think merits examination, however.
[Also, an observation: Arithmetic is an optimizer. Consider the number of input states and the number of output states of addition.]