As with intent alignment, I also think corrigibility gets you more than non-obstruction. (Although perhaps you just meant the weaker statement that corrigibility implies non-obstruction?)
This depends what corrigibility means here. As I define it in the post, you can correct the AI without being manipulated-corrigibility gets you non-obstruction at best, but it isn’t sufficient for non-obstruction:
… the AI moves so fast that we can’t correct it in time, even though it isn’t inclined to stop or manipulate us. In that case, corrigibility isn’t enough, whereas non-obstruction is.
If you’re talking about Paul-corrigibility, I think that Paul-corrigibility gets you more than non-obstruction because Paul-corrigibility seems like it’s secretly just intent alignment, which we agree is stronger than non-obstruction:
Paul Christiano named [this concept] the “basin of corrigibility”, but I don’t like that name because only a few of the named desiderata actually correspond to the natural definition of “corrigibility.” This then overloads “corrigibility” with the responsibilities of “intent alignment.”
As I define it in the post, you can correct the AI without being manipulated-corrigibility gets you non-obstruction at best, but it isn’t sufficient for non-obstruction
I agree that fast-moving AI systems could lead to not getting non-obstruction. However, I think that as long as AI systems are sufficiently slow, corrigibility usually gets you “good things happen”, not just “bad things don’t happen”—you aren’t just not obstructed, the AI actively helps, because you can keep correcting it to make it better aligned with you.
For a formal version of this, see Consequences of Misaligned AI (will be summarized in the next Alignment Newsletter), which proves that in a particular model of misalignment that (there is a human strategy where) your-corrigibility + slow movement guarantees that the AI system leads to an increase in utility, and your-corrigibility + impact regularization guarantees reaching the maximum possible utility in the limit.
I agree that fast-moving AI systems could lead to not getting non-obstruction. However, I think that as long as AI systems are sufficiently slow, corrigibility usually gets you “good things happen”, not just “bad things don’t happen”—you aren’t just not obstructed, the AI actively helps, because you can keep correcting it to make it better aligned with you.
I think I agree with some claim along the lines of “corrigibility + slowness + [certain environmental assumptions] + ??? ⇒ non-obstruction (and maybe even robust weak impact alignment)”, but the ”???” might depend on the human’s intelligence, the set of goals which aren’t obstructed, and maybe a few other things. So I’d want to think very carefully about what these conditions are, before supposing the implication holds in the cases we care about.
I agree that that paper is both relevant and suggestive of this kind of implication holding in a lot of cases.
This depends what corrigibility means here. As I define it in the post, you can correct the AI without being manipulated-corrigibility gets you non-obstruction at best, but it isn’t sufficient for non-obstruction:
If you’re talking about Paul-corrigibility, I think that Paul-corrigibility gets you more than non-obstruction because Paul-corrigibility seems like it’s secretly just intent alignment, which we agree is stronger than non-obstruction:
I agree that fast-moving AI systems could lead to not getting non-obstruction. However, I think that as long as AI systems are sufficiently slow, corrigibility usually gets you “good things happen”, not just “bad things don’t happen”—you aren’t just not obstructed, the AI actively helps, because you can keep correcting it to make it better aligned with you.
For a formal version of this, see Consequences of Misaligned AI (will be summarized in the next Alignment Newsletter), which proves that in a particular model of misalignment that (there is a human strategy where) your-corrigibility + slow movement guarantees that the AI system leads to an increase in utility, and your-corrigibility + impact regularization guarantees reaching the maximum possible utility in the limit.
I think I agree with some claim along the lines of “corrigibility + slowness + [certain environmental assumptions] + ??? ⇒ non-obstruction (and maybe even robust weak impact alignment)”, but the ”???” might depend on the human’s intelligence, the set of goals which aren’t obstructed, and maybe a few other things. So I’d want to think very carefully about what these conditions are, before supposing the implication holds in the cases we care about.
I agree that that paper is both relevant and suggestive of this kind of implication holding in a lot of cases.
Yeah, I agree with all of that.