Stepping into a real-world example, consider a text file, and three cases, illustrating different things:
First case: Entirely ineffective ZIP compression, (some processes), effective ZIP compression. If we treat the ineffective ZIP compression as “the optimizer”, then it is clear that some compression happens later in the sequence of processes; the number of bits of optimization increased. However, the existence or non-existence of the first ineffective ZIP compression has no effect on the number of bits of optimization, so maybe this isn’t quite appropriate as a way of thinking about this.
Second case: Anti-effective ZIP compression, (some processes), effective ZIP compression. The anti-effective ZIP compression, instead of compressing the file, maybe just copies the file twenty times. Then the effective ZIP compression takes that, and possibly compresses it nearly as effectively as the first case. The existence or non-existence of the anti-effective ZIP compression does matter in this case—in particular, if we compare the state of its output compared to the final state, the anti-effective ZIP compression creates a wider optimization discrepancy; it would appear to “optimize at a distance.”
Third case: Garbage data cleanup, (some processes), effective ZIP compression. Here we enter an interesting case, because, if we treat the garbage data cleanup as our optimizer, it both optimizes the immediate output, and, in the event that the garbage is significantly higher entropy than the rest of the data, might make the effective ZIP compression more effective.
I think the third case really matters here, but in general, once you introduce a second optimizer, you can create emergent optimizations which don’t strictly belong to either optimizer. Additionally, once we submit the existence of emergent optimizations, it may entirely be possible for an optimizer to optimize “at a distance”, even without a specific second optimization taking place. Consider any given optimization as a sequence of operations each of which is necessary; if we treat the first operation as “the optimizer”, then the optimization as a whole happens “at a distance”, given that without the first operation, the optimization fails, and it is only optimized at the final step in the process (given that each step is necessary).
I’m intrigued by these examples but I’m not sure it translates. It sounds like you are interpreting “difference of size of file in bits between reference and optimized versions” as the thing the KL divergence is measuring, but I don’t think that’s true. I’m assuming here that the reference is where the first step does nothing and outputs the input file unchanged (effectively just case 1). Let’s explicitly assume that the input file is a randomly chosen English word.
Suppose a fourth case where our “optimizer” outputs the file “0” regardless of input. The end result is a tiny zip file. Under the “reference” condition, the original file is zipped and is still a few bytes, so we have reduced the file size by a few bytes at most. However, the KL divergence is infinite! After all “0” is not an English word and so it’s zip never appears in the output distribution of the reference but occurs with probability 1 under our optimizer. So the KL divergence is not at all equal to the number of bits of filesize reduced. Obviously this example is rather contrived, but it suffices to show that we can’t directly translate intuition about filesizes to intuition about bits-of-optimization as measured by KL divergence.
Were you going for a different intuition with these examples?
It isn’t the thing that the KL divergence is measuring, it is an analogy for it. The KL divergence is measuring the amount of informational entropy; strictly speaking, zipping a file has no effect in those terms.
However, we can take those examples more or less intact and place them in informational-entropy terms; the third gets a little weird in the doing, however.
So, having an intuition for what the ZIP file does, the equivalent “examples”:
Example 1: KLE(Reference optimizer output stage, ineffective optimizer output) is 0; KLE(Reference final stage, ineffective optimizer final stage) is also 0. Not appropriate as a way of thinking about this, but helpful to frame the next two examples.
Example 2: KLE(Reference optimizer output stage, antieffective optimizer output) is -N; KLE(Reference final stage, antieffective optimizer final stage) is 0. Supposing an antieffective optimizer and an optimizer can both exist such that a future optimizer optimizes away the inefficiency introduced by the antieffective optimizer, we may observe a violation of the conclusion that, for N steps, adding an optimizer cannot result in a situation such that the KL distance of N+1 must be less than or equal to the KL distance of N, relative to their reference cases.
Example 3: The central conceit to this example is harder to translate; the basic idea is that one optimization can make a future optimization more efficient. Critically for this, the future optimization can vary in effectiveness depending on the input parameters. Suppose, for a moment, a state-reduction algorithm on a set of states n, using something like the opening example:
For n < 64, each state mapping n maps to the state {ceiling (n / 4)} [2 bits of optimization]
For n > 64, each state mapping n maps to the state {ceiling (n / 2)} [1 bit of optimization]
Then, for a probability distribution such that the number of states is greater than 64 but less than 128, a precursor optimization which halves the number of states (1 bit of optimization) will create an additional bit of optimization at this second optimizer.
Or, for a very contrived example, the first “optimizer” does minimal optimization, and mostly just encodes a control sequence into the output which enables the second optimizer to actually optimize.
There’s something here about process-entropy versus data-entropy which I think merits examination, however.
[Also, an observation: Arithmetic is an optimizer. Consider the number of input states and the number of output states of addition.]
Stepping into a real-world example, consider a text file, and three cases, illustrating different things:
First case: Entirely ineffective ZIP compression, (some processes), effective ZIP compression. If we treat the ineffective ZIP compression as “the optimizer”, then it is clear that some compression happens later in the sequence of processes; the number of bits of optimization increased. However, the existence or non-existence of the first ineffective ZIP compression has no effect on the number of bits of optimization, so maybe this isn’t quite appropriate as a way of thinking about this.
Second case: Anti-effective ZIP compression, (some processes), effective ZIP compression. The anti-effective ZIP compression, instead of compressing the file, maybe just copies the file twenty times. Then the effective ZIP compression takes that, and possibly compresses it nearly as effectively as the first case. The existence or non-existence of the anti-effective ZIP compression does matter in this case—in particular, if we compare the state of its output compared to the final state, the anti-effective ZIP compression creates a wider optimization discrepancy; it would appear to “optimize at a distance.”
Third case: Garbage data cleanup, (some processes), effective ZIP compression. Here we enter an interesting case, because, if we treat the garbage data cleanup as our optimizer, it both optimizes the immediate output, and, in the event that the garbage is significantly higher entropy than the rest of the data, might make the effective ZIP compression more effective.
I think the third case really matters here, but in general, once you introduce a second optimizer, you can create emergent optimizations which don’t strictly belong to either optimizer. Additionally, once we submit the existence of emergent optimizations, it may entirely be possible for an optimizer to optimize “at a distance”, even without a specific second optimization taking place. Consider any given optimization as a sequence of operations each of which is necessary; if we treat the first operation as “the optimizer”, then the optimization as a whole happens “at a distance”, given that without the first operation, the optimization fails, and it is only optimized at the final step in the process (given that each step is necessary).
I’m intrigued by these examples but I’m not sure it translates. It sounds like you are interpreting “difference of size of file in bits between reference and optimized versions” as the thing the KL divergence is measuring, but I don’t think that’s true. I’m assuming here that the reference is where the first step does nothing and outputs the input file unchanged (effectively just case 1). Let’s explicitly assume that the input file is a randomly chosen English word.
Suppose a fourth case where our “optimizer” outputs the file “0” regardless of input. The end result is a tiny zip file. Under the “reference” condition, the original file is zipped and is still a few bytes, so we have reduced the file size by a few bytes at most. However, the KL divergence is infinite! After all “0” is not an English word and so it’s zip never appears in the output distribution of the reference but occurs with probability 1 under our optimizer. So the KL divergence is not at all equal to the number of bits of filesize reduced. Obviously this example is rather contrived, but it suffices to show that we can’t directly translate intuition about filesizes to intuition about bits-of-optimization as measured by KL divergence.
Were you going for a different intuition with these examples?
It isn’t the thing that the KL divergence is measuring, it is an analogy for it. The KL divergence is measuring the amount of informational entropy; strictly speaking, zipping a file has no effect in those terms.
However, we can take those examples more or less intact and place them in informational-entropy terms; the third gets a little weird in the doing, however.
So, having an intuition for what the ZIP file does, the equivalent “examples”:
Example 1: KLE(Reference optimizer output stage, ineffective optimizer output) is 0; KLE(Reference final stage, ineffective optimizer final stage) is also 0. Not appropriate as a way of thinking about this, but helpful to frame the next two examples.
Example 2: KLE(Reference optimizer output stage, antieffective optimizer output) is -N; KLE(Reference final stage, antieffective optimizer final stage) is 0. Supposing an antieffective optimizer and an optimizer can both exist such that a future optimizer optimizes away the inefficiency introduced by the antieffective optimizer, we may observe a violation of the conclusion that, for N steps, adding an optimizer cannot result in a situation such that the KL distance of N+1 must be less than or equal to the KL distance of N, relative to their reference cases.
Example 3: The central conceit to this example is harder to translate; the basic idea is that one optimization can make a future optimization more efficient. Critically for this, the future optimization can vary in effectiveness depending on the input parameters. Suppose, for a moment, a state-reduction algorithm on a set of states n, using something like the opening example:
For n < 64, each state mapping n maps to the state {ceiling (n / 4)} [2 bits of optimization]
For n > 64, each state mapping n maps to the state {ceiling (n / 2)} [1 bit of optimization]
Then, for a probability distribution such that the number of states is greater than 64 but less than 128, a precursor optimization which halves the number of states (1 bit of optimization) will create an additional bit of optimization at this second optimizer.
Or, for a very contrived example, the first “optimizer” does minimal optimization, and mostly just encodes a control sequence into the output which enables the second optimizer to actually optimize.
There’s something here about process-entropy versus data-entropy which I think merits examination, however.
[Also, an observation: Arithmetic is an optimizer. Consider the number of input states and the number of output states of addition.]