Oh whoops, my bad. Replace “intent alignment” with “corrigibility” there. Specifically, the thing I disagree with is:
Corrigibility is an instrumental strategy for inducing non-obstruction in an AI.
As with intent alignment, I also think corrigibility gets you more than non-obstruction.
(Although perhaps you just meant the weaker statement that corrigibility implies non-obstruction?)
I think this framework also helps motivate why intent alignment is desirable: for a capable agent, the impact alignment won’t dependent as much on the choice of environment. We’re going to have uncertainty about the dynamics of the 2-player game we use to abstract and reason about the task at hand, but intent alignment would mean that doesn’t matter as much. This is something like “to reason using the AU landscape, you need fewer assumptions about how the agent works as long as you know it’s intent aligned.”
But this requires stepping up a level from the model I outline in the post, which I didn’t do here for brevity.
I think I agree with all of this, but I feel like it’s pretty separate from the concepts in this post? Like, you could have written this paragraph to me before I had ever read this post and I think I would have understood it.
(Here I’m trying to justify my claim that I don’t expect the concepts introduced in this post to be that useful in non-EU-maximizer risk models.)
(Also, my usual mental model isn’t really ‘EU maximizer risk → AI x-risk’, it’s more like ‘one natural source of single/single AI x-risk is the learned policy doing bad things for various reasons, one of which is misspecification, and often EU maximizer risk is a nice frame for thinking about that’)
Yes, I also am not a fan of “misspecification of reward” as a risk model; I agree that if I did like that risk model, the EU maximizer model would be a nice frame for it.
(If you mean misspecification of things other than the reward, then I probably don’t think EU maximizer risk is a good frame for thinking about that.)
As with intent alignment, I also think corrigibility gets you more than non-obstruction. (Although perhaps you just meant the weaker statement that corrigibility implies non-obstruction?)
This depends what corrigibility means here. As I define it in the post, you can correct the AI without being manipulated-corrigibility gets you non-obstruction at best, but it isn’t sufficient for non-obstruction:
… the AI moves so fast that we can’t correct it in time, even though it isn’t inclined to stop or manipulate us. In that case, corrigibility isn’t enough, whereas non-obstruction is.
If you’re talking about Paul-corrigibility, I think that Paul-corrigibility gets you more than non-obstruction because Paul-corrigibility seems like it’s secretly just intent alignment, which we agree is stronger than non-obstruction:
Paul Christiano named [this concept] the “basin of corrigibility”, but I don’t like that name because only a few of the named desiderata actually correspond to the natural definition of “corrigibility.” This then overloads “corrigibility” with the responsibilities of “intent alignment.”
As I define it in the post, you can correct the AI without being manipulated-corrigibility gets you non-obstruction at best, but it isn’t sufficient for non-obstruction
I agree that fast-moving AI systems could lead to not getting non-obstruction. However, I think that as long as AI systems are sufficiently slow, corrigibility usually gets you “good things happen”, not just “bad things don’t happen”—you aren’t just not obstructed, the AI actively helps, because you can keep correcting it to make it better aligned with you.
For a formal version of this, see Consequences of Misaligned AI (will be summarized in the next Alignment Newsletter), which proves that in a particular model of misalignment that (there is a human strategy where) your-corrigibility + slow movement guarantees that the AI system leads to an increase in utility, and your-corrigibility + impact regularization guarantees reaching the maximum possible utility in the limit.
I agree that fast-moving AI systems could lead to not getting non-obstruction. However, I think that as long as AI systems are sufficiently slow, corrigibility usually gets you “good things happen”, not just “bad things don’t happen”—you aren’t just not obstructed, the AI actively helps, because you can keep correcting it to make it better aligned with you.
I think I agree with some claim along the lines of “corrigibility + slowness + [certain environmental assumptions] + ??? ⇒ non-obstruction (and maybe even robust weak impact alignment)”, but the ”???” might depend on the human’s intelligence, the set of goals which aren’t obstructed, and maybe a few other things. So I’d want to think very carefully about what these conditions are, before supposing the implication holds in the cases we care about.
I agree that that paper is both relevant and suggestive of this kind of implication holding in a lot of cases.
Oh whoops, my bad. Replace “intent alignment” with “corrigibility” there. Specifically, the thing I disagree with is:
As with intent alignment, I also think corrigibility gets you more than non-obstruction.
(Although perhaps you just meant the weaker statement that corrigibility implies non-obstruction?)
I think I agree with all of this, but I feel like it’s pretty separate from the concepts in this post? Like, you could have written this paragraph to me before I had ever read this post and I think I would have understood it.
(Here I’m trying to justify my claim that I don’t expect the concepts introduced in this post to be that useful in non-EU-maximizer risk models.)
Yes, I also am not a fan of “misspecification of reward” as a risk model; I agree that if I did like that risk model, the EU maximizer model would be a nice frame for it.
(If you mean misspecification of things other than the reward, then I probably don’t think EU maximizer risk is a good frame for thinking about that.)
This depends what corrigibility means here. As I define it in the post, you can correct the AI without being manipulated-corrigibility gets you non-obstruction at best, but it isn’t sufficient for non-obstruction:
If you’re talking about Paul-corrigibility, I think that Paul-corrigibility gets you more than non-obstruction because Paul-corrigibility seems like it’s secretly just intent alignment, which we agree is stronger than non-obstruction:
I agree that fast-moving AI systems could lead to not getting non-obstruction. However, I think that as long as AI systems are sufficiently slow, corrigibility usually gets you “good things happen”, not just “bad things don’t happen”—you aren’t just not obstructed, the AI actively helps, because you can keep correcting it to make it better aligned with you.
For a formal version of this, see Consequences of Misaligned AI (will be summarized in the next Alignment Newsletter), which proves that in a particular model of misalignment that (there is a human strategy where) your-corrigibility + slow movement guarantees that the AI system leads to an increase in utility, and your-corrigibility + impact regularization guarantees reaching the maximum possible utility in the limit.
I think I agree with some claim along the lines of “corrigibility + slowness + [certain environmental assumptions] + ??? ⇒ non-obstruction (and maybe even robust weak impact alignment)”, but the ”???” might depend on the human’s intelligence, the set of goals which aren’t obstructed, and maybe a few other things. So I’d want to think very carefully about what these conditions are, before supposing the implication holds in the cases we care about.
I agree that that paper is both relevant and suggestive of this kind of implication holding in a lot of cases.
Yeah, I agree with all of that.