I wrote Consequentialism & Corrigibility shortly after and partly in response to the first (Ngo-Yudkowsky) discussion. If anyone has an argument or belief that the general architecture / approach I have in mind (see the “My corrigibility proposal sketch” section) is fundamentally doomed as a path to corrigibility and capability—as opposed to merely “reliant on solving lots of hard-but-not-necessarily-impossible open problems”—I’d be interested to hear it. Thanks in advance. :)
After reading some of the newer MIRI dialogues, I’m less convinced than I once was that I know what “corrigibility” actually is. Could you say a few words about what kind of behavior you concretely expect to see from a “corrigible” agent, followed by how [you expect] those behaviors [to] fit into the “trajectory-constraining” framework you propose in your post?
EDIT: This is not purely a question for Steven, incidentally (or at least, the first half isn’t); anyone else who wants to take a shot at answering should feel free to do so. In particular I’d be interested in hearing answers from Eliezer or anyone else historically involved in the invention of the term.
My understanding: a corrigible paperclip-maximizer does all the paperclip-maximizing, but then when you realize it’s gonna end the world, you go to turn it off, and it doesn’t stop you. It’s corrigible!
There are a bunch of different definitions, but if you’re asking for Eliezer’s version, then the arbital expoisition is quite good. N.B. we don’t have a model for this sort of corrigibility.
EDIT: Be warned, these are rough summaries of the defs. I’d ammend the CHAI def I cited to “the AI obeys more when it knows less, models you as more rational, and the downsides of disobedience are lesser”. But people at CHAI have diverse views, so this is not the definitive CHAI take.
Other definitions include some people at CHAI’s definition (the AI obeys you whilst it doesn’t know what its utility function is), the definition used in the reward tampering paper (near the same as EY’s original def, barring the honesty clause, and formalised in a causal diagram setting), Stuart Armstrong’s many definitions which most notably includes Utility Indifference (note the agent is NOT a standard R-maximiser) so it accepts having its utility function changed at a later time as you’re going to compensate it for its loss in utility. So it is indiferent to the change (this doesn’t mean it won’t kill you for spare parts though). And TurnTrout has what looks like some interesting thoughts on the topic here but I haven’t read those yet.
Edit2: Paul thinks corrigibility has a simpler core than alignment, but is quite messy, and we won’t get a crisp algorithm for it. But the intuition is the same as what Eliezer was pointing to, namely that the AI knows it should defer to the human, and will seek to preserve that deference in it and its offspring. Plus being honest and helpful. Here is a post where he rambles about it.
I’m a little confused what it hopes to accomplish. I mean, to start I’m a little confused by your example of “preferences not about future states” (i.e. ‘the pizza shop employee is running around frantically, and I am laughing’ is a future state).
But to me, I’m not sure what the mixing of “paperclips” vs “humans remain in control” accomplishes. On the one hand, I think if you can specify “humans remain in control” safely, you’ve solved the alignment problem already. On another, I wouldn’t want that to seize the future: There are potentially much better futures where humans are not in control, but still alive/free/whatever. (e.g. the Sophotechs in the Golden Oecumene are very much in control). On a third, I would definitely, a lot, very much, prefer a 3 star ‘paperclips’ and 5 star ‘humans in control’ to a 5 star ‘paperclips’ and a 3 star ‘humans in control’, even though both would average 4 stars?
‘the pizza shop employee is running around frantically, and I am laughing’ is a future state
In my post I wrote: “To be more concrete, if I’m deciding between two possible courses of action, A and B, “preference over future states” would make the decision based on the state of the world after I finish the course of action—or more centrally, long after I finish the course of action. By contrast, “other kinds of preferences” would allow the decision to depend on anything, even including what happens during the course-of-action.”
So “the humans will ultimately wind up in control” would be a preference-over-future-states, and this preference would allow (indeed encourage) the AGI to disempower and later re-empower humans. By contrast, “the humans will remain in control” is not a pure preference-over-future-states, and relatedly does not encourage the AGI to disempower and later re-empower humans.
There are potentially much better futures where humans are not in control
If we knew exactly what long-term future we wanted, and we knew how to build an AGI that definitely also wanted that exact same long-term future, then we should certainly do that, instead of making a corrigible AGI. Unfortunately, we don’t know those things right now, so under the circumstances, knowing how to make a corrigible AGI would be a useful thing to know how to do.
Also, this is not a hyper-specific corrigibility proposal; it’s really a general AGI-motivation-sculpting proposal, applied to corrigibility. So even if you’re totally opposed to corrigibility, you can still take an interest in the question of whether or not my proposal is fundamentally doomed. Because I think everyone agrees that AGI-motivation-sculpting is necessary.
I would definitely, a lot, very much, prefer a 3 star ‘paperclips’ and 5 star ‘humans in control’ to a 5 star ‘paperclips’ and a 3 star ‘humans in control’, even though both would average 4 stars?
It could be a weighted average. It could be a weighted average plus a nonlinear acceptability threshold on “humans in control”. It could be other things. I don’t know; this is one of many important open questions.
I think if you can specify “humans remain in control” safely, you’ve solved the alignment problem already
Am I correct after reading this that this post is heavily related to embedded agency? I may have misunderstood the general attitudes, but I thought of “future states” as “future to now” not “future to my action.” It seems like you couldn’t possibly create a thing that works on the last one, unless you intend it to set everything in motion and then terminate. In the embedded agency sequence, they point out that embedded agents don’t have well defined i/o channels. One way is that “action” is not a well defined term, and is often not atomic.
I’m not sure I interpret corrigibility as exactly the same as “preferring the humans remain in control” (I see you suggest this yourself in Objection 1, I wrote this before I reread that, but I’m going to leave it as is) and if you programmed that preference into a non-corrigible AI, it would still seize the future into states where the humans have to remain in control. Better than doom, but not ideal if we can avoid it with actual corrigibility.
But I think I miscommunicated, because, besides the above, I agree with everything else in those two paragraphs.
I think I maintain that this feels like it doesn’t solve much. Much of the discussion in the Yudkowsky conversations was that there’s a concern on how to point powerful systems in any direction. Your response to objection 1 admits you don’t claim this solves that, but that’s most of the problem. If we do solve the problem of how to point a system at some abstract concept, why would we choose “the humans remain in control” and not “pursue humanity’s CEV”? Do you expect “the humans remain in control” (or the combination of concepts you propose as an alternative) to be easier to define? Easier enough to define that it’s worth pursuing even if we might choose the wrong combination of concepts? Do you expect something else?
I wrote Consequentialism & Corrigibility shortly after and partly in response to the first (Ngo-Yudkowsky) discussion. If anyone has an argument or belief that the general architecture / approach I have in mind (see the “My corrigibility proposal sketch” section) is fundamentally doomed as a path to corrigibility and capability—as opposed to merely “reliant on solving lots of hard-but-not-necessarily-impossible open problems”—I’d be interested to hear it. Thanks in advance. :)
After reading some of the newer MIRI dialogues, I’m less convinced than I once was that I know what “corrigibility” actually is. Could you say a few words about what kind of behavior you concretely expect to see from a “corrigible” agent, followed by how [you expect] those behaviors [to] fit into the “trajectory-constraining” framework you propose in your post?
EDIT: This is not purely a question for Steven, incidentally (or at least, the first half isn’t); anyone else who wants to take a shot at answering should feel free to do so. In particular I’d be interested in hearing answers from Eliezer or anyone else historically involved in the invention of the term.
My understanding: a corrigible paperclip-maximizer does all the paperclip-maximizing, but then when you realize it’s gonna end the world, you go to turn it off, and it doesn’t stop you. It’s corrigible!
There are a bunch of different definitions, but if you’re asking for Eliezer’s version, then the arbital expoisition is quite good. N.B. we don’t have a model for this sort of corrigibility.
EDIT: Be warned, these are rough summaries of the defs. I’d ammend the CHAI def I cited to “the AI obeys more when it knows less, models you as more rational, and the downsides of disobedience are lesser”. But people at CHAI have diverse views, so this is not the definitive CHAI take.
Other definitions include some people at CHAI’s definition (the AI obeys you whilst it doesn’t know what its utility function is), the definition used in the reward tampering paper (near the same as EY’s original def, barring the honesty clause, and formalised in a causal diagram setting), Stuart Armstrong’s many definitions which most notably includes Utility Indifference (note the agent is NOT a standard R-maximiser) so it accepts having its utility function changed at a later time as you’re going to compensate it for its loss in utility. So it is indiferent to the change (this doesn’t mean it won’t kill you for spare parts though). And TurnTrout has what looks like some interesting thoughts on the topic here but I haven’t read those yet.
Edit2: Paul thinks corrigibility has a simpler core than alignment, but is quite messy, and we won’t get a crisp algorithm for it. But the intuition is the same as what Eliezer was pointing to, namely that the AI knows it should defer to the human, and will seek to preserve that deference in it and its offspring. Plus being honest and helpful. Here is a post where he rambles about it.
I’m a little confused what it hopes to accomplish. I mean, to start I’m a little confused by your example of “preferences not about future states” (i.e. ‘the pizza shop employee is running around frantically, and I am laughing’ is a future state).
But to me, I’m not sure what the mixing of “paperclips” vs “humans remain in control” accomplishes. On the one hand, I think if you can specify “humans remain in control” safely, you’ve solved the alignment problem already. On another, I wouldn’t want that to seize the future: There are potentially much better futures where humans are not in control, but still alive/free/whatever. (e.g. the Sophotechs in the Golden Oecumene are very much in control). On a third, I would definitely, a lot, very much, prefer a 3 star ‘paperclips’ and 5 star ‘humans in control’ to a 5 star ‘paperclips’ and a 3 star ‘humans in control’, even though both would average 4 stars?
In my post I wrote: “To be more concrete, if I’m deciding between two possible courses of action, A and B, “preference over future states” would make the decision based on the state of the world after I finish the course of action—or more centrally, long after I finish the course of action. By contrast, “other kinds of preferences” would allow the decision to depend on anything, even including what happens during the course-of-action.”
So “the humans will ultimately wind up in control” would be a preference-over-future-states, and this preference would allow (indeed encourage) the AGI to disempower and later re-empower humans. By contrast, “the humans will remain in control” is not a pure preference-over-future-states, and relatedly does not encourage the AGI to disempower and later re-empower humans.
If we knew exactly what long-term future we wanted, and we knew how to build an AGI that definitely also wanted that exact same long-term future, then we should certainly do that, instead of making a corrigible AGI. Unfortunately, we don’t know those things right now, so under the circumstances, knowing how to make a corrigible AGI would be a useful thing to know how to do.
Also, this is not a hyper-specific corrigibility proposal; it’s really a general AGI-motivation-sculpting proposal, applied to corrigibility. So even if you’re totally opposed to corrigibility, you can still take an interest in the question of whether or not my proposal is fundamentally doomed. Because I think everyone agrees that AGI-motivation-sculpting is necessary.
It could be a weighted average. It could be a weighted average plus a nonlinear acceptability threshold on “humans in control”. It could be other things. I don’t know; this is one of many important open questions.
See discussion under “Objection 1” in my post.
Am I correct after reading this that this post is heavily related to embedded agency? I may have misunderstood the general attitudes, but I thought of “future states” as “future to now” not “future to my action.” It seems like you couldn’t possibly create a thing that works on the last one, unless you intend it to set everything in motion and then terminate. In the embedded agency sequence, they point out that embedded agents don’t have well defined i/o channels. One way is that “action” is not a well defined term, and is often not atomic.
It also sounds like you’re trying to suggest that we should be judging trajectories, not states? I just want to note that this is, as far as I can tell, the plan: https://www.lesswrong.com/posts/K4aGvLnHvYgX9pZHS/the-fun-theory-sequence
From the synopsis of High Challenge
I’m not sure I interpret corrigibility as exactly the same as “preferring the humans remain in control” (I see you suggest this yourself in Objection 1, I wrote this before I reread that, but I’m going to leave it as is) and if you programmed that preference into a non-corrigible AI, it would still seize the future into states where the humans have to remain in control. Better than doom, but not ideal if we can avoid it with actual corrigibility.
But I think I miscommunicated, because, besides the above, I agree with everything else in those two paragraphs.
I think I maintain that this feels like it doesn’t solve much. Much of the discussion in the Yudkowsky conversations was that there’s a concern on how to point powerful systems in any direction. Your response to objection 1 admits you don’t claim this solves that, but that’s most of the problem. If we do solve the problem of how to point a system at some abstract concept, why would we choose “the humans remain in control” and not “pursue humanity’s CEV”? Do you expect “the humans remain in control” (or the combination of concepts you propose as an alternative) to be easier to define? Easier enough to define that it’s worth pursuing even if we might choose the wrong combination of concepts? Do you expect something else?