To be clear, I am definitely not arguing for a pure mechanism-design approach to all of AI alignment. The argument in the OP is relevant to inner optimizers because we can’t just directly choose which goals to program into them. We can directly choose which goals to program into an outer optimizer, and I definitely think that’s the right way to go.
To be clear, I am definitely not arguing for a pure mechanism-design approach to all of AI alignment. The argument in the OP is relevant to inner optimizers because we can’t just directly choose which goals to program into them. We can directly choose which goals to program into an outer optimizer, and I definitely think that’s the right way to go.