If mesa-optimizers are likely to occur in future AI systems by default, and there turns out to be some way of preventing mesa-optimizers from arising, then instead of solving the inner alignment problem, it may be better to design systems to not produce a mesa-optimizer at all.
I’m not sure I understand this proposal. If you prevent mesa-optimizers from arising, won’t that drastically reduce the capability of the system that you’re building (e.g., the resulting policy/model won’t be able to do any kind of sophisticated problem solving to handle problems that don’t appear in the training data). Are you proposing to instead manually design an aligned optimizer that would be competitive with the mesa-optimzer that would have been created?
I think it’s still an open question to what extent not having any mesa-optimization would hurt capabilities, but my sense is indeed that mesa-optimization is likely inevitable if you want to build safe AGI which is competitive with a baseline unaligned approach. Thus, I tend towards thinking that the right strategy is to understand that you’re definitely going to produce a mesa-optimizer, and just have a really strong story for why it will be aligned.
I’m not sure I understand this proposal. If you prevent mesa-optimizers from arising, won’t that drastically reduce the capability of the system that you’re building (e.g., the resulting policy/model won’t be able to do any kind of sophisticated problem solving to handle problems that don’t appear in the training data). Are you proposing to instead manually design an aligned optimizer that would be competitive with the mesa-optimzer that would have been created?
I think it’s still an open question to what extent not having any mesa-optimization would hurt capabilities, but my sense is indeed that mesa-optimization is likely inevitable if you want to build safe AGI which is competitive with a baseline unaligned approach. Thus, I tend towards thinking that the right strategy is to understand that you’re definitely going to produce a mesa-optimizer, and just have a really strong story for why it will be aligned.