Thanks for the thoughtful review! I think this is overall a good read of what I was saying. I agree now that redundancy would not work.
One clarification:
The mesaobjective that was aligned to our base objective in the original setting is no longer aligned in the new setting
When I said that the ‘human-level’ AGI is assumed to be aligned, I meant that it has an aligned mesa-objective (corrigibly or internally) -- not that it has an objective that was functionally aligned on the training distribution, but may not remain aligned under distribution shift. I thought that internally/corrigibly aligned mesa-objectives are intent-aligned on all (plausible) distributions by definition...
If you already have a mesaobjective fully aligned everywhere from the start, then you don’t really need to invoke the crystallization argument; the crystallization argument is basically about how misaligned objectives can get locked in.
I’m not sure I understand. We might not be on the same page.
Here’s the concern I’m addressing: Let’s say we build a fully aligned human-level AGI, but we want to scale it up to superintelligence. This seems much harder to do safely than to train the human-level AGI since you need a training signal that’s better than human feedback/imitation.
Here’s the point I am making about that concern: It might actually be quite easy to scale an already aligned AGI up to superintelligence—even if you don’t have a scalable outer-aligned training signal—because the AGI will be motivated to crystallize its aligned objective.
Thanks for the thoughtful review! I think this is overall a good read of what I was saying. I agree now that redundancy would not work.
One clarification:
When I said that the ‘human-level’ AGI is assumed to be aligned, I meant that it has an aligned mesa-objective (corrigibly or internally) -- not that it has an objective that was functionally aligned on the training distribution, but may not remain aligned under distribution shift. I thought that internally/corrigibly aligned mesa-objectives are intent-aligned on all (plausible) distributions by definition...
If you already have a mesaobjective fully aligned everywhere from the start, then you don’t really need to invoke the crystallization argument; the crystallization argument is basically about how misaligned objectives can get locked in.
I’m not sure I understand. We might not be on the same page.
Here’s the concern I’m addressing:
Let’s say we build a fully aligned human-level AGI, but we want to scale it up to superintelligence. This seems much harder to do safely than to train the human-level AGI since you need a training signal that’s better than human feedback/imitation.
Here’s the point I am making about that concern:
It might actually be quite easy to scale an already aligned AGI up to superintelligence—even if you don’t have a scalable outer-aligned training signal—because the AGI will be motivated to crystallize its aligned objective.