I feel like the second paragraph doesn’t quite capture the main idea, especially the first sentence. It’s not just that mesa optimizers aren’t the only way that a system with good training performance can fail in deployment—that much is trivial. It’s that if the incentives reward misaligned mesa-optimizers, then they very likely also reward inner agents with essentially the same behavior as the misaligned mesa-optimizers but which are not “trying” to game the bad incentives.
The interesting takeaway is that the possibility of deceptive inner optimizers implies nearly-identical failures which don’t involve any inner optimizers. It’s not just “systems without inner optimizers can still be misaligned”, it’s “if we just get rid of the misaligned inner optimizers in a system which would otherwise have them, then that system can probably still stumble on parameters which result in essentially the same bad behavior”. Thus the idea that the “real problem” is the incentives, not the inner optimizers.
This post suggests that in any training setup in which mesa optimizers would normally be incentivized, it is not sufficient to just prevent mesa optimization from happening. The fact that mesa optimizers could have arisen means that the incentives were bad. If you somehow removed mesa optimizers from the search space, there would still be a selection pressure for agents that without any malicious intent end up using heuristics that exploit the bad incentives. As a result, we should focus on fixing the incentives, rather than on excluding mesa optimizers from the search space.
I feel like the second paragraph doesn’t quite capture the main idea, especially the first sentence. It’s not just that mesa optimizers aren’t the only way that a system with good training performance can fail in deployment—that much is trivial. It’s that if the incentives reward misaligned mesa-optimizers, then they very likely also reward inner agents with essentially the same behavior as the misaligned mesa-optimizers but which are not “trying” to game the bad incentives.
The interesting takeaway is that the possibility of deceptive inner optimizers implies nearly-identical failures which don’t involve any inner optimizers. It’s not just “systems without inner optimizers can still be misaligned”, it’s “if we just get rid of the misaligned inner optimizers in a system which would otherwise have them, then that system can probably still stumble on parameters which result in essentially the same bad behavior”. Thus the idea that the “real problem” is the incentives, not the inner optimizers.
Changed second paragraph to:
How does that sound?
Ok, that works.