I think estimating the probability/plausibility of real-world inner alignment problems is a neglected issue.
However, I don’t find your analysis very compelling.
Number 1 seems to me to approach this from the wrong angle. This is a technical problem, not a social problem. The social version of the problem seems to share very little in common with the technical version.
Number 2 assumes the AGI is aligned. But inner alignment is a barrier to that. You cannot work from the assumption that we have a powerful AGI on our side when solving alignment problems, unless you’ve somehow separately ensured that that will be the case.
Number 3 isn’t true in current deep learning architectures, EG, GPT; it seems you’d have to design a system where that’s the case. And it’s not yet obvious whether that’s a promising route.
Number 4: an AGI could make it to SI by inner alignment failure. The inner optimizer could solve the inner alignment problem for itself, while refusing to be outer-aligned.
I think estimating the probability/plausibility of real-world inner alignment problems is a neglected issue.
However, I don’t find your analysis very compelling.
Number 1 seems to me to approach this from the wrong angle. This is a technical problem, not a social problem. The social version of the problem seems to share very little in common with the technical version.
Number 2 assumes the AGI is aligned. But inner alignment is a barrier to that. You cannot work from the assumption that we have a powerful AGI on our side when solving alignment problems, unless you’ve somehow separately ensured that that will be the case.
Number 3 isn’t true in current deep learning architectures, EG, GPT; it seems you’d have to design a system where that’s the case. And it’s not yet obvious whether that’s a promising route.
Number 4: an AGI could make it to SI by inner alignment failure. The inner optimizer could solve the inner alignment problem for itself, while refusing to be outer-aligned.