I’m somewhat surprised you aren’t really echoing the comment you left at the top of the google doc wrt separation of concerns.
Reproducing the comment I think you mean here:
As an instrumental strategy we often talk about reducing “make AI good” to “make AI corrigible”, and we can split that up:
1. “Make AI good for our goals” But who knows what our goals are, and who knows how to program a goal into our AI system, so let’s instead:
2. “Make AI that would be good regardless of what goal we have” (I prefer asking for an AI that is good rather than an AI that is not-bad; this is effectively a definition of impact alignment.) But who knows how to get an AI to infer our goals well, so let’s:
3. “Make AI that would preserve our option value / leave us in control of which goals get optimized for in the future” Non-obstructiveness is one way we could formalize such a property in terms of outcomes, though I feel like “preserve our option value” is a better one. In contrast, Paul-corrigibility is not about an outcome-based property, but instead about how a mind might be designed such that it likely has that property regardless of what environment it is in.
I suspect that the point about not liking the utility maximization model is upstream of this. For example, I care a lot about the fact intent-based methods can (hopefully) be environment-independent, and see this as a major benefit; but on the utility maximization model it doesn’t matter.
But also, explaining this would be a lot of words, and still wouldn’t really do the topic justice; that’s really the main reason it isn’t in the newsletter.
Why do you think the concept’s usefulness is predicated on utility maximizers pursuing the wrong reward function? The analysis only analyzes the consequences of some AI policy.
I look at the conclusions you come to, such as “we should reduce spikiness in AU landscape”, and it seems to me that approaches that do this sort of thing (low impact, mild optimization) make more sense in the EU maximizer risk model than the one I usually use (which unfortunately I haven’t written up anywhere). You do also mention intent alignment as an instrumental strategy for non-obstruction, but there I disagree with you—I think intent alignment gets you a lot more than non-obstruction; it gets you a policy that actually makes your life better (as opposed to just “not worse”).
I’m not claiming that the analysis is wrong under other risk models, just that it isn’t that useful.
For example, I care a lot about the fact intent-based methods can (hopefully) be environment-independent, and see this as a major benefit; but on the utility maximization model it doesn’t matter.
I think this framework also helps motivate why intent alignment is desirable: for a capable agent, the impact alignment won’t dependent as much on the choice of environment. We’re going to have uncertainty about the dynamics of the 2-player game we use to abstract and reason about the task at hand, but intent alignment would mean that doesn’t matter as much. This is something like “to reason using the AU landscape, you need fewer assumptions about how the agent works as long as you know it’s intent aligned.”
But this requires stepping up a level from the model I outline in the post, which I didn’t do here for brevity.
(Also, my usual mental model isn’t really ‘EU maximizer risk → AI x-risk’, it’s more like ‘one natural source of single/single AI x-risk is the learned policy doing bad things for various reasons, one of which is misspecification, and often EU maximizer risk is a nice frame for thinking about that’)
You do also mention intent alignment as an instrumental strategy for non-obstruction, but there I disagree with you—I think intent alignment gets you a lot more than non-obstruction; it gets you a policy that actually makes your life better (as opposed to just “not worse”).
This wasn’t the intended takeaway; the post reads:
Intent alignment: avoid spikiness by having the AI want to be flexibly aligned with us and broadly empowering.
Oh whoops, my bad. Replace “intent alignment” with “corrigibility” there. Specifically, the thing I disagree with is:
Corrigibility is an instrumental strategy for inducing non-obstruction in an AI.
As with intent alignment, I also think corrigibility gets you more than non-obstruction.
(Although perhaps you just meant the weaker statement that corrigibility implies non-obstruction?)
I think this framework also helps motivate why intent alignment is desirable: for a capable agent, the impact alignment won’t dependent as much on the choice of environment. We’re going to have uncertainty about the dynamics of the 2-player game we use to abstract and reason about the task at hand, but intent alignment would mean that doesn’t matter as much. This is something like “to reason using the AU landscape, you need fewer assumptions about how the agent works as long as you know it’s intent aligned.”
But this requires stepping up a level from the model I outline in the post, which I didn’t do here for brevity.
I think I agree with all of this, but I feel like it’s pretty separate from the concepts in this post? Like, you could have written this paragraph to me before I had ever read this post and I think I would have understood it.
(Here I’m trying to justify my claim that I don’t expect the concepts introduced in this post to be that useful in non-EU-maximizer risk models.)
(Also, my usual mental model isn’t really ‘EU maximizer risk → AI x-risk’, it’s more like ‘one natural source of single/single AI x-risk is the learned policy doing bad things for various reasons, one of which is misspecification, and often EU maximizer risk is a nice frame for thinking about that’)
Yes, I also am not a fan of “misspecification of reward” as a risk model; I agree that if I did like that risk model, the EU maximizer model would be a nice frame for it.
(If you mean misspecification of things other than the reward, then I probably don’t think EU maximizer risk is a good frame for thinking about that.)
As with intent alignment, I also think corrigibility gets you more than non-obstruction. (Although perhaps you just meant the weaker statement that corrigibility implies non-obstruction?)
This depends what corrigibility means here. As I define it in the post, you can correct the AI without being manipulated-corrigibility gets you non-obstruction at best, but it isn’t sufficient for non-obstruction:
… the AI moves so fast that we can’t correct it in time, even though it isn’t inclined to stop or manipulate us. In that case, corrigibility isn’t enough, whereas non-obstruction is.
If you’re talking about Paul-corrigibility, I think that Paul-corrigibility gets you more than non-obstruction because Paul-corrigibility seems like it’s secretly just intent alignment, which we agree is stronger than non-obstruction:
Paul Christiano named [this concept] the “basin of corrigibility”, but I don’t like that name because only a few of the named desiderata actually correspond to the natural definition of “corrigibility.” This then overloads “corrigibility” with the responsibilities of “intent alignment.”
As I define it in the post, you can correct the AI without being manipulated-corrigibility gets you non-obstruction at best, but it isn’t sufficient for non-obstruction
I agree that fast-moving AI systems could lead to not getting non-obstruction. However, I think that as long as AI systems are sufficiently slow, corrigibility usually gets you “good things happen”, not just “bad things don’t happen”—you aren’t just not obstructed, the AI actively helps, because you can keep correcting it to make it better aligned with you.
For a formal version of this, see Consequences of Misaligned AI (will be summarized in the next Alignment Newsletter), which proves that in a particular model of misalignment that (there is a human strategy where) your-corrigibility + slow movement guarantees that the AI system leads to an increase in utility, and your-corrigibility + impact regularization guarantees reaching the maximum possible utility in the limit.
I agree that fast-moving AI systems could lead to not getting non-obstruction. However, I think that as long as AI systems are sufficiently slow, corrigibility usually gets you “good things happen”, not just “bad things don’t happen”—you aren’t just not obstructed, the AI actively helps, because you can keep correcting it to make it better aligned with you.
I think I agree with some claim along the lines of “corrigibility + slowness + [certain environmental assumptions] + ??? ⇒ non-obstruction (and maybe even robust weak impact alignment)”, but the ”???” might depend on the human’s intelligence, the set of goals which aren’t obstructed, and maybe a few other things. So I’d want to think very carefully about what these conditions are, before supposing the implication holds in the cases we care about.
I agree that that paper is both relevant and suggestive of this kind of implication holding in a lot of cases.
Reproducing the comment I think you mean here:
I suspect that the point about not liking the utility maximization model is upstream of this. For example, I care a lot about the fact intent-based methods can (hopefully) be environment-independent, and see this as a major benefit; but on the utility maximization model it doesn’t matter.
But also, explaining this would be a lot of words, and still wouldn’t really do the topic justice; that’s really the main reason it isn’t in the newsletter.
I look at the conclusions you come to, such as “we should reduce spikiness in AU landscape”, and it seems to me that approaches that do this sort of thing (low impact, mild optimization) make more sense in the EU maximizer risk model than the one I usually use (which unfortunately I haven’t written up anywhere). You do also mention intent alignment as an instrumental strategy for non-obstruction, but there I disagree with you—I think intent alignment gets you a lot more than non-obstruction; it gets you a policy that actually makes your life better (as opposed to just “not worse”).
I’m not claiming that the analysis is wrong under other risk models, just that it isn’t that useful.
I think this framework also helps motivate why intent alignment is desirable: for a capable agent, the impact alignment won’t dependent as much on the choice of environment. We’re going to have uncertainty about the dynamics of the 2-player game we use to abstract and reason about the task at hand, but intent alignment would mean that doesn’t matter as much. This is something like “to reason using the AU landscape, you need fewer assumptions about how the agent works as long as you know it’s intent aligned.”
But this requires stepping up a level from the model I outline in the post, which I didn’t do here for brevity.
(Also, my usual mental model isn’t really ‘EU maximizer risk → AI x-risk’, it’s more like ‘one natural source of single/single AI x-risk is the learned policy doing bad things for various reasons, one of which is misspecification, and often EU maximizer risk is a nice frame for thinking about that’)
This wasn’t the intended takeaway; the post reads:
This is indeed stronger than non-obstruction.
Oh whoops, my bad. Replace “intent alignment” with “corrigibility” there. Specifically, the thing I disagree with is:
As with intent alignment, I also think corrigibility gets you more than non-obstruction.
(Although perhaps you just meant the weaker statement that corrigibility implies non-obstruction?)
I think I agree with all of this, but I feel like it’s pretty separate from the concepts in this post? Like, you could have written this paragraph to me before I had ever read this post and I think I would have understood it.
(Here I’m trying to justify my claim that I don’t expect the concepts introduced in this post to be that useful in non-EU-maximizer risk models.)
Yes, I also am not a fan of “misspecification of reward” as a risk model; I agree that if I did like that risk model, the EU maximizer model would be a nice frame for it.
(If you mean misspecification of things other than the reward, then I probably don’t think EU maximizer risk is a good frame for thinking about that.)
This depends what corrigibility means here. As I define it in the post, you can correct the AI without being manipulated-corrigibility gets you non-obstruction at best, but it isn’t sufficient for non-obstruction:
If you’re talking about Paul-corrigibility, I think that Paul-corrigibility gets you more than non-obstruction because Paul-corrigibility seems like it’s secretly just intent alignment, which we agree is stronger than non-obstruction:
I agree that fast-moving AI systems could lead to not getting non-obstruction. However, I think that as long as AI systems are sufficiently slow, corrigibility usually gets you “good things happen”, not just “bad things don’t happen”—you aren’t just not obstructed, the AI actively helps, because you can keep correcting it to make it better aligned with you.
For a formal version of this, see Consequences of Misaligned AI (will be summarized in the next Alignment Newsletter), which proves that in a particular model of misalignment that (there is a human strategy where) your-corrigibility + slow movement guarantees that the AI system leads to an increase in utility, and your-corrigibility + impact regularization guarantees reaching the maximum possible utility in the limit.
I think I agree with some claim along the lines of “corrigibility + slowness + [certain environmental assumptions] + ??? ⇒ non-obstruction (and maybe even robust weak impact alignment)”, but the ”???” might depend on the human’s intelligence, the set of goals which aren’t obstructed, and maybe a few other things. So I’d want to think very carefully about what these conditions are, before supposing the implication holds in the cases we care about.
I agree that that paper is both relevant and suggestive of this kind of implication holding in a lot of cases.
Yeah, I agree with all of that.