My problem is that A is defined as the output of the optimizer, M0 is defined as A, so P(A|ref) is central to the entire inequality. However, what is the output of an optimizer if we are without the optimizer? The given examples (Daniel’s and John’s) both gloss over the question of P(A|ref) and implicitly treat it as uniform over the possible choices the optimizer could have made. In the box-with-slots examples, what happens if there is no optimizer? I don’t know.
In the MMO example, what is the output without a player-optimizer? I don’t think it’s a randomly chosen string of 10,000 bit inputs. No MMO I’ve ever played chooses random actions if you walk away from it. Yet Daniel’s interpretation assumes that that’s the distribution. Anything else, the player choosing the least likely reference outcome can beat the bounds in Daniel’s answer. I.e. his example makes it clear that bits-of-optimization applied by the player does not correspond to bits-of-input, unless the reference is a randomly chosen string of inputs. And in that case, the bound feels trivial and uninsightful. If every possible action I can choose has a p chance of happening without me, then the output that I choose will have had a chance of p by definition. And the distribution of outcomes I selected will then always have had at least a p chance of having been selected without me (plus some chance that it happened through other possible output choices). No math needed to make me believe that!
None of this applies to the equation itself. It works for any choice of P(A|ref). But I think that changes the interpretations given (such as Daniels) and without it I’m not sure I that this builds intuition for anything in the way that I think it’s trying to do. Is “uniformly choose an output” really a useful reference? I don’t think it is useful for intuition. And in the useful references I choose (constant output), the bound becomes trivial (infinite KL divergence). So what is a useful choice?
It seems like it would be nice in Daniel’s example for P(A|ref) to be the action distribution of an “instinctual” or “non-optimising” player. I don’t know how to recover that. You could imagine something like an n-gram model of player inputs across the MMO.
My problem is that A is defined as the output of the optimizer, M0 is defined as A, so P(A|ref) is central to the entire inequality. However, what is the output of an optimizer if we are without the optimizer? The given examples (Daniel’s and John’s) both gloss over the question of P(A|ref) and implicitly treat it as uniform over the possible choices the optimizer could have made. In the box-with-slots examples, what happens if there is no optimizer? I don’t know.
In the MMO example, what is the output without a player-optimizer? I don’t think it’s a randomly chosen string of 10,000 bit inputs. No MMO I’ve ever played chooses random actions if you walk away from it. Yet Daniel’s interpretation assumes that that’s the distribution. Anything else, the player choosing the least likely reference outcome can beat the bounds in Daniel’s answer. I.e. his example makes it clear that bits-of-optimization applied by the player does not correspond to bits-of-input, unless the reference is a randomly chosen string of inputs. And in that case, the bound feels trivial and uninsightful. If every possible action I can choose has a p chance of happening without me, then the output that I choose will have had a chance of p by definition. And the distribution of outcomes I selected will then always have had at least a p chance of having been selected without me (plus some chance that it happened through other possible output choices). No math needed to make me believe that!
None of this applies to the equation itself. It works for any choice of P(A|ref). But I think that changes the interpretations given (such as Daniels) and without it I’m not sure I that this builds intuition for anything in the way that I think it’s trying to do. Is “uniformly choose an output” really a useful reference? I don’t think it is useful for intuition. And in the useful references I choose (constant output), the bound becomes trivial (infinite KL divergence). So what is a useful choice?
Good point!
It seems like it would be nice in Daniel’s example for P(A|ref) to be the action distribution of an “instinctual” or “non-optimising” player. I don’t know how to recover that. You could imagine something like an n-gram model of player inputs across the MMO.