Planned summary for the Alignment Newsletter (note it’s written quite differently from the post, and so I may have introduced errors, so please check more carefully than usual):
Suppose we trained our agent to behave well on some set of training tasks. <@Mesa optimization@>(@Risks from Learned Optimization in Advanced Machine Learning Systems@) suggests that we may still have a problem: the agent might perform poorly during deployment, because it ends up optimizing for some misaligned _mesa objective_ that only agrees with the base objective on the training distribution.
This post points out that this is not the only way systems can fail catastrophically during deployment: if the incentives were not designed appropriately, they may still select for agents that have learned heuristics that are not in our best interests, but nonetheless lead to acceptable performance during training. This can be true even if the agents are not explicitly “trying” to take advantage of the bad incentives, and thus can apply to agents that are not mesa optimizers.
I feel like the second paragraph doesn’t quite capture the main idea, especially the first sentence. It’s not just that mesa optimizers aren’t the only way that a system with good training performance can fail in deployment—that much is trivial. It’s that if the incentives reward misaligned mesa-optimizers, then they very likely also reward inner agents with essentially the same behavior as the misaligned mesa-optimizers but which are not “trying” to game the bad incentives.
The interesting takeaway is that the possibility of deceptive inner optimizers implies nearly-identical failures which don’t involve any inner optimizers. It’s not just “systems without inner optimizers can still be misaligned”, it’s “if we just get rid of the misaligned inner optimizers in a system which would otherwise have them, then that system can probably still stumble on parameters which result in essentially the same bad behavior”. Thus the idea that the “real problem” is the incentives, not the inner optimizers.
This post suggests that in any training setup in which mesa optimizers would normally be incentivized, it is not sufficient to just prevent mesa optimization from happening. The fact that mesa optimizers could have arisen means that the incentives were bad. If you somehow removed mesa optimizers from the search space, there would still be a selection pressure for agents that without any malicious intent end up using heuristics that exploit the bad incentives. As a result, we should focus on fixing the incentives, rather than on excluding mesa optimizers from the search space.
Planned summary for the Alignment Newsletter (note it’s written quite differently from the post, and so I may have introduced errors, so please check more carefully than usual):
I feel like the second paragraph doesn’t quite capture the main idea, especially the first sentence. It’s not just that mesa optimizers aren’t the only way that a system with good training performance can fail in deployment—that much is trivial. It’s that if the incentives reward misaligned mesa-optimizers, then they very likely also reward inner agents with essentially the same behavior as the misaligned mesa-optimizers but which are not “trying” to game the bad incentives.
The interesting takeaway is that the possibility of deceptive inner optimizers implies nearly-identical failures which don’t involve any inner optimizers. It’s not just “systems without inner optimizers can still be misaligned”, it’s “if we just get rid of the misaligned inner optimizers in a system which would otherwise have them, then that system can probably still stumble on parameters which result in essentially the same bad behavior”. Thus the idea that the “real problem” is the incentives, not the inner optimizers.
Changed second paragraph to:
How does that sound?
Ok, that works.