Well, first you need to make sure your training procedure isn’t introducing any incentives that would push you away from getting that sort of myopia. Myopic RL with an actually myopic training procedure like a policy gradient algorithm is a good start.
Tbc, I’m claiming that this is the part that breaks. One way to operationalize this: in the coin flip example above, does this training scheme converge to “M reports the truth” in the limit of infinite data, model capacity, exploration etc.? I would guess that that isn’t true. (In comparison, I think you can prove that self-play converges to the Nash equilibrium for debate since it is a zero-sum game, and since there are no cycles in the coin flip example I’d expect you could prove that imitative iterated amplification converges to the truth as well.)
At some point I might write up some simple code to implement the coin flip experiment with your training scheme and see what happens.
Tbc, I’m claiming that this is the part that breaks. One way to operationalize this: in the coin flip example above, does this training scheme converge to “M reports the truth” in the limit of infinite data, model capacity, exploration etc.? I would guess that that isn’t true. (In comparison, I think you can prove that self-play converges to the Nash equilibrium for debate since it is a zero-sum game, and since there are no cycles in the coin flip example I’d expect you could prove that imitative iterated amplification converges to the truth as well.)
At some point I might write up some simple code to implement the coin flip experiment with your training scheme and see what happens.