Under the “reward as selection” framing, I find the behaviour much less confusing:
We use reward to select for actions that led to the agent reaching the coin.
This selects for models implementing the algorithm “move towards the coin”.
However, it also selects for models implementing the algorithm “always move to the right”.
It should therefore not be surprising you can end up with an agent that always moves to the right and not necessarily towards the coin.
I’ve been reconsidering the coin run example as well recently from a causal perspective, and your articulation helped me crystalize my thoughts. Building on these points above, it seems clear that the core issue is one of causal confusion: that is, the true causal model M is “move right” → “get the coin” → “get reward”. However, if the variable of “did you get the coin” is effectively latent (because the model selection doesn’t discriminate on this variable) then the causal model M is indistinguishable from M’ which is “move right” → “get reward” (which though it is not the true causal model governing the system, generates the same observational distribution).
In fact, the incorrect model M’ actually has shorter description length, so it may be that here there is a bias against learning the true causal model. If so, I believe we have a compelling explanation for the coin runner phenomenon which does not require the existence of a mesa optimizer, and which does indicate we should be more concerned about causal confusion.
I’ve been reconsidering the coin run example as well recently from a causal perspective, and your articulation helped me crystalize my thoughts. Building on these points above, it seems clear that the core issue is one of causal confusion: that is, the true causal model M is “move right” → “get the coin” → “get reward”. However, if the variable of “did you get the coin” is effectively latent (because the model selection doesn’t discriminate on this variable) then the causal model M is indistinguishable from M’ which is “move right” → “get reward” (which though it is not the true causal model governing the system, generates the same observational distribution).
In fact, the incorrect model M’ actually has shorter description length, so it may be that here there is a bias against learning the true causal model. If so, I believe we have a compelling explanation for the coin runner phenomenon which does not require the existence of a mesa optimizer, and which does indicate we should be more concerned about causal confusion.