I’m intrigued by these examples but I’m not sure it translates. It sounds like you are interpreting “difference of size of file in bits between reference and optimized versions” as the thing the KL divergence is measuring, but I don’t think that’s true. I’m assuming here that the reference is where the first step does nothing and outputs the input file unchanged (effectively just case 1). Let’s explicitly assume that the input file is a randomly chosen English word.
Suppose a fourth case where our “optimizer” outputs the file “0” regardless of input. The end result is a tiny zip file. Under the “reference” condition, the original file is zipped and is still a few bytes, so we have reduced the file size by a few bytes at most. However, the KL divergence is infinite! After all “0” is not an English word and so it’s zip never appears in the output distribution of the reference but occurs with probability 1 under our optimizer. So the KL divergence is not at all equal to the number of bits of filesize reduced. Obviously this example is rather contrived, but it suffices to show that we can’t directly translate intuition about filesizes to intuition about bits-of-optimization as measured by KL divergence.
Were you going for a different intuition with these examples?
It isn’t the thing that the KL divergence is measuring, it is an analogy for it. The KL divergence is measuring the amount of informational entropy; strictly speaking, zipping a file has no effect in those terms.
However, we can take those examples more or less intact and place them in informational-entropy terms; the third gets a little weird in the doing, however.
So, having an intuition for what the ZIP file does, the equivalent “examples”:
Example 1: KLE(Reference optimizer output stage, ineffective optimizer output) is 0; KLE(Reference final stage, ineffective optimizer final stage) is also 0. Not appropriate as a way of thinking about this, but helpful to frame the next two examples.
Example 2: KLE(Reference optimizer output stage, antieffective optimizer output) is -N; KLE(Reference final stage, antieffective optimizer final stage) is 0. Supposing an antieffective optimizer and an optimizer can both exist such that a future optimizer optimizes away the inefficiency introduced by the antieffective optimizer, we may observe a violation of the conclusion that, for N steps, adding an optimizer cannot result in a situation such that the KL distance of N+1 must be less than or equal to the KL distance of N, relative to their reference cases.
Example 3: The central conceit to this example is harder to translate; the basic idea is that one optimization can make a future optimization more efficient. Critically for this, the future optimization can vary in effectiveness depending on the input parameters. Suppose, for a moment, a state-reduction algorithm on a set of states n, using something like the opening example:
For n < 64, each state mapping n maps to the state {ceiling (n / 4)} [2 bits of optimization]
For n > 64, each state mapping n maps to the state {ceiling (n / 2)} [1 bit of optimization]
Then, for a probability distribution such that the number of states is greater than 64 but less than 128, a precursor optimization which halves the number of states (1 bit of optimization) will create an additional bit of optimization at this second optimizer.
Or, for a very contrived example, the first “optimizer” does minimal optimization, and mostly just encodes a control sequence into the output which enables the second optimizer to actually optimize.
There’s something here about process-entropy versus data-entropy which I think merits examination, however.
[Also, an observation: Arithmetic is an optimizer. Consider the number of input states and the number of output states of addition.]
I’m intrigued by these examples but I’m not sure it translates. It sounds like you are interpreting “difference of size of file in bits between reference and optimized versions” as the thing the KL divergence is measuring, but I don’t think that’s true. I’m assuming here that the reference is where the first step does nothing and outputs the input file unchanged (effectively just case 1). Let’s explicitly assume that the input file is a randomly chosen English word.
Suppose a fourth case where our “optimizer” outputs the file “0” regardless of input. The end result is a tiny zip file. Under the “reference” condition, the original file is zipped and is still a few bytes, so we have reduced the file size by a few bytes at most. However, the KL divergence is infinite! After all “0” is not an English word and so it’s zip never appears in the output distribution of the reference but occurs with probability 1 under our optimizer. So the KL divergence is not at all equal to the number of bits of filesize reduced. Obviously this example is rather contrived, but it suffices to show that we can’t directly translate intuition about filesizes to intuition about bits-of-optimization as measured by KL divergence.
Were you going for a different intuition with these examples?
It isn’t the thing that the KL divergence is measuring, it is an analogy for it. The KL divergence is measuring the amount of informational entropy; strictly speaking, zipping a file has no effect in those terms.
However, we can take those examples more or less intact and place them in informational-entropy terms; the third gets a little weird in the doing, however.
So, having an intuition for what the ZIP file does, the equivalent “examples”:
Example 1: KLE(Reference optimizer output stage, ineffective optimizer output) is 0; KLE(Reference final stage, ineffective optimizer final stage) is also 0. Not appropriate as a way of thinking about this, but helpful to frame the next two examples.
Example 2: KLE(Reference optimizer output stage, antieffective optimizer output) is -N; KLE(Reference final stage, antieffective optimizer final stage) is 0. Supposing an antieffective optimizer and an optimizer can both exist such that a future optimizer optimizes away the inefficiency introduced by the antieffective optimizer, we may observe a violation of the conclusion that, for N steps, adding an optimizer cannot result in a situation such that the KL distance of N+1 must be less than or equal to the KL distance of N, relative to their reference cases.
Example 3: The central conceit to this example is harder to translate; the basic idea is that one optimization can make a future optimization more efficient. Critically for this, the future optimization can vary in effectiveness depending on the input parameters. Suppose, for a moment, a state-reduction algorithm on a set of states n, using something like the opening example:
For n < 64, each state mapping n maps to the state {ceiling (n / 4)} [2 bits of optimization]
For n > 64, each state mapping n maps to the state {ceiling (n / 2)} [1 bit of optimization]
Then, for a probability distribution such that the number of states is greater than 64 but less than 128, a precursor optimization which halves the number of states (1 bit of optimization) will create an additional bit of optimization at this second optimizer.
Or, for a very contrived example, the first “optimizer” does minimal optimization, and mostly just encodes a control sequence into the output which enables the second optimizer to actually optimize.
There’s something here about process-entropy versus data-entropy which I think merits examination, however.
[Also, an observation: Arithmetic is an optimizer. Consider the number of input states and the number of output states of addition.]