It should be noted, however, that while inner alignment is a robustness problem, the occurrence of unintended mesa-optimization is not. If the base optimizer’s objective is not a perfect measure of the human’s goals, then preventing mesa-optimizers from arising at all might be the preferred outcome. In such a case, it might be desirable to create a system that is strongly optimized for the base objective within some limited domain without that system engaging in open-ended optimization in new environments.(11) One possible way to accomplish this might be to use strong optimization at the level of the base optimizer during training to prevent strong optimization at the level of the mesa-optimizer.(11)
I don’t really follow this paragraph, especially the bolded.
Why would mesa-optimisation arising when not intended not be an issue for robustness (the mesa-optimiser could generalise capably of distribution but pursue the wrong goal).
The rest of the post also doesn’t defend that claim; it feels more like defending a claim like:
The non-occurrence of mesa-optimisation is not a robustness problem.
I don’t really follow this paragraph, especially the bolded.
Why would mesa-optimisation arising when not intended not be an issue for robustness (the mesa-optimiser could generalise capably of distribution but pursue the wrong goal).
The rest of the post also doesn’t defend that claim; it feels more like defending a claim like: