As I define it in the post, you can correct the AI without being manipulated-corrigibility gets you non-obstruction at best, but it isn’t sufficient for non-obstruction
I agree that fast-moving AI systems could lead to not getting non-obstruction. However, I think that as long as AI systems are sufficiently slow, corrigibility usually gets you “good things happen”, not just “bad things don’t happen”—you aren’t just not obstructed, the AI actively helps, because you can keep correcting it to make it better aligned with you.
For a formal version of this, see Consequences of Misaligned AI (will be summarized in the next Alignment Newsletter), which proves that in a particular model of misalignment that (there is a human strategy where) your-corrigibility + slow movement guarantees that the AI system leads to an increase in utility, and your-corrigibility + impact regularization guarantees reaching the maximum possible utility in the limit.
I agree that fast-moving AI systems could lead to not getting non-obstruction. However, I think that as long as AI systems are sufficiently slow, corrigibility usually gets you “good things happen”, not just “bad things don’t happen”—you aren’t just not obstructed, the AI actively helps, because you can keep correcting it to make it better aligned with you.
I think I agree with some claim along the lines of “corrigibility + slowness + [certain environmental assumptions] + ??? ⇒ non-obstruction (and maybe even robust weak impact alignment)”, but the ”???” might depend on the human’s intelligence, the set of goals which aren’t obstructed, and maybe a few other things. So I’d want to think very carefully about what these conditions are, before supposing the implication holds in the cases we care about.
I agree that that paper is both relevant and suggestive of this kind of implication holding in a lot of cases.
I agree that fast-moving AI systems could lead to not getting non-obstruction. However, I think that as long as AI systems are sufficiently slow, corrigibility usually gets you “good things happen”, not just “bad things don’t happen”—you aren’t just not obstructed, the AI actively helps, because you can keep correcting it to make it better aligned with you.
For a formal version of this, see Consequences of Misaligned AI (will be summarized in the next Alignment Newsletter), which proves that in a particular model of misalignment that (there is a human strategy where) your-corrigibility + slow movement guarantees that the AI system leads to an increase in utility, and your-corrigibility + impact regularization guarantees reaching the maximum possible utility in the limit.
I think I agree with some claim along the lines of “corrigibility + slowness + [certain environmental assumptions] + ??? ⇒ non-obstruction (and maybe even robust weak impact alignment)”, but the ”???” might depend on the human’s intelligence, the set of goals which aren’t obstructed, and maybe a few other things. So I’d want to think very carefully about what these conditions are, before supposing the implication holds in the cases we care about.
I agree that that paper is both relevant and suggestive of this kind of implication holding in a lot of cases.
Yeah, I agree with all of that.