It seems to me that for a corrigible, moderately superhuman AI, it is mostly the metaphilosophical competence of the human that matters, rather than that of the AI system. I think there are a bunch of confusions presented here, and I’ll run through them, although let me disclaim that it’s Eliezer’s notion of corrigibility that I’m most familiar with, and so I’m arguing that your critiques fall flat on Eliezer’s version.
“[The AI should] figure out whether I built the right AI and correct any mistakes I made, remain informed about the AI’s behavior and avoid unpleasant surprises, make better decisions and clarify my preferences, acquire resources and remain in effective control of them, ensure that my AI systems continue to do all of these nice things...”
You omitted a key component of the quote that almost entirely reversis its meaning. The correct quote would read [emphasis added]: “[The AI should] help me figure out whether I built the right AI and correct any mistakes I made, remain informed about the AI’s behavior and avoid unpleasant surprises, make better decisions and clarify my preferences, acquire resources and remain in effective control of them, ensure that my AI systems continue to do all of these nice things...”. i.e. the AI should help with ensuring that the control continues to reside in the human, rather than in itself.
The messiah would in his heart of hearts have the best of intentions for them, and everyone would know that.
To my understanding, the point of corrigibility is that a corrigible system is supposed to benefit its human operators even if its intentions are somewhat wrong, so it is rather a non sequitur to say that an agent is corrigible because it has the best of intentions in its heart of hearts. If it truly fully understood human intentions and values, corrigibility may even be unnecessary.
Clearly you’re right that corrigibility is not sufficient for safety. A corrigible agent can still be instructed by its human operators to make a decision that is irreversibly bad. But it seems to help, and to help a lot. The point of a corrigible AI si that once it takes a few murderous actions, you can switch it off, or tell it to pursue a different objective. So for the messiah example, a corrigible messiah might poison a few followers and then when it is discovered, respond to an instruction to desist. An incorrigible messiah might be asked to stop murdering followers, but continue to do so anyway. So many of the more mundane existential risks would be mitigated by corrigibility.
And what about more exotic ones? I argue they would also be greatly (though not entirely) reduced. Consider that a corrigible messiah may still hide poison for all of the humans at once, leading to an irrevocably terrible outcome. But why should it? If it thinks it is doing well by the humans, then its harmful actions ought to be transparent. Perhaps the AI system would’s actions would not be transparent if it intelligence was so radically great that it was inclined to act in fast an incomprehensible ways. But it is hard to see how we could know with confidence that such a radically intelligent AI is the kind we will soon be dealing with. And even if we are going to deal with that kind of AI, there could be other remedies that would be especially helpful in such scenarios. For example, an AI that permits informed oversight of its activities could be superb if it was already corrigible. Then, it could not only provide truthful explanations of its future plans but also respond to feedback on them. Overall, if we had an AI system that was (1) only a little bit superhumanly smart, (2) corrigible, and (3) providing informative explanations of its planned behaviour, then it would seem that we are in a pretty good spot.
“This is absurd. Wouldn’t they obviously have cared about animal suffering if they’d reflected on it, and chosen to do something about it before blissing themselves out?”
Yeah, but they never got around to that before blissing themselves out.
I think you’re making an important point here, but here is how I would put it: If you have an AI system that is properly deferential to humans, you still need to rely on the humans not to give it any existentially catastrophic orders. But the corrigibility/deferential behavior has changed the situation from one in which you’re relying on the metaphilosophical competence of the AI, to one in which you’re relying on the metaphilosphical competence on the human (albeit as filtered through the actions of the AI system). In the latter case, yes, you need to survive having a human’s power increased by some N-fold. (Not necessarily 10^15 as in the more extreme self-improvement scenarios, but by some N>1). So when you get a corrigible AI, you still need to be very careful with what you tell it to do, but your situation is substantially improved. Note that what I’m saying is at least in some tension with the traditional story of indirect normativity. Rather than trying to give the AI very general instructions for its interpretation, I’m saying that we should in the first instance try to stabilize the world so that we can do more metaphilosophical reasoning ourselves before trying to program an AI system that can carry out the conclusions of that thinking or perhaps continue it.
Would it want to? I think yes, because it’s incentivized not to optimize for human values, but to turn humans into yes-men… The only thing I can imagine that would robustly prevent this manipulation is to formally guarantee the AI to be metaphilosophically competent itself.
Yes, an approval-directed agent might reward-hack by causing the human to approve of things that it does not value. And it might compromise the humans’ reasoning abilities while doing so. But why must the AI system’s metaphilosophical competence be the only defeator? Why couldn’t this be achieved by quantilizing, or otherwise throttling the agent’s capabilities? By restricting the agent’s activities to some narrow domain? By having the agent somehow be deeply uncertain about where the human’s approval mechanism resides? None of these seems clearly viable, but neither do any of them seem clearly impossible, especially in cases where the AI system’s capabilities are overall not far beyond those of its human operators.
Overall, I’d say superintelligent messiahs are sometimes corrigible, and they’re more likely to be aligned if so.
Overall, my impression is that you thought I was saying, “A corrigible AI might turn against its operators and kill us all, and the only way to prevent that is by ensuring the AI is metaphilosophically competent.” I was really trying to say “A corrigible AI might not turn against its operators and might not kill us all, and the outcome can still be catastrophic. To prevent this, we’d definitely want our operators to be metaphilosophically competent, and we’d definitely want our AI to not corrupt them. The latter may be simple to ensure if the AI isn’t broadly superhumanly powerful, but may be difficult to ensure if the AI is broadly superhumanly powerful and we don’t have formal guarantees”. My sense is that we actually agree on the latter and agree that the former is wrong. Does this sound right? (I do think it’s concerning that my original post led you to reasonably interpret me as saying the former. Hopefully my edits make this clearer. I suspect part of what happened is that I think humans surviving but producing astronomical moral waste is about as bad as human extinction, so I didn’t bother delineating them, even though this is probably an unusual position.)
See below for individual responses.
You omitted a key component of the quote that almost entirely reversis its meaning. The correct quote would read [emphasis added]: “[The AI should] helpme figure out whether I built the right AI and correct any mistakes I made, remain informed about the AI’s behavior and avoid unpleasant surprises, make better decisions and clarify my preferences, acquire resources and remain in effective control of them, ensure that my AI systems continue to do all of these nice things...”. i.e. the AI should help with ensuring that the control continues to reside in the human, rather than in itself.
I edited my post accordingly. This doesn’t change my perspective at all.
To my understanding, the point of corrigibility is that a corrigible system is supposed to benefit its human operators even if its intentions are somewhat wrong, so it is rather a non sequitur to say that an agent is corrigible because it has the best of intentions in its heart of hearts. If it truly fully understood human intentions and values, corrigibility may even be unnecessary.
I gave that description to illustrate one way it is like a corrigible agent, which does have the best of intentions (to help its operators), not to imply that a well-intentioned agent is corrigible. I edited it to “In his heart of hearts, the messiah would be trying to help them, and everyone would know that.” Does that make it clearer?
Clearly you’re right that corrigibility is not sufficient for safety. A corrigible agent can still be instructed by its human operators to make a decision that is irreversibly bad. But it seems to help, and to help a lot. The point of a corrigible AI si that once it takes a few murderous actions, you can switch it off, or tell it to pursue a different objective. So for the messiah example, a corrigible messiah might poison a few followers and then when it is discovered, respond to an instruction to desist. An incorrigible messiah might be asked to stop murdering followers, but continue to do so anyway. So many of the more mundane existential risks would be mitigated by corrigibility.
I agree that corrigibility helps a bunch with mundane existential risks, and think that e.g. a corrigible misaligned superintelligence is unlikely to lead to self-annihilation, but pretty likely to lead to astronomical moral waste. I edited from “Surely, we wouldn’t build a superintelligence that would guide us down such an insidious path?” to “I don’t think a corrigible superintelligence would guide us down such an insidious path. I even think it would substantially improve the human condition, and would manage to avoid killing us all. But I think it might still lead us to astronomical moral waste.” Does this clarify things?
And what about more exotic ones? I argue they would also be greatly (though not entirely) reduced. Consider that a corrigible messiah may still hide poison for all of the humans at once, leading to an irrevocably terrible outcome. But why should it? If it thinks it is doing well by the humans, then its harmful actions ought to be transparent. Perhaps the AI system would’s actions would not be transparent if it intelligence was so radically great that it was inclined to act in fast an incomprehensible ways. But it is hard to see how we could know with confidence that such a radically intelligent AI is the kind we will soon be dealing with. And even if we are going to deal with that kind of AI, there could be other remedies that would be especially helpful in such scenarios. For example, an AI that permits informed oversight of its activities could be superb if it was already corrigible. Then, it could not only provide truthful explanations of its future plans but also respond to feedback on them. Overall, if we had an AI system that was (1) only a little bit superhumanly smart, (2) corrigible, and (3) providing informative explanations of its planned behaviour, then it would seem that we are in a pretty good spot.
I agree with all this.
I think you’re making an important point here, but here is how I would put it: If you have an AI system that is properly deferential to humans, you still need to rely on the humans not to give it any existentially catastrophic orders. But the corrigibility/deferential behavior has changed the situation from one in which you’re relying on the metaphilosophical competence of the AI, to one in which you’re relying on the metaphilosphical competence on the human (albeit as filtered through the actions of the AI system). In the latter case, yes, you need to survive having a human’s power increased by some N-fold. (Not necessarily 10^15 as in the more extreme self-improvement scenarios, but by some N>1). So when you get a corrigible AI, you still need to be very careful with what you tell it to do, but your situation is substantially improved. Note that what I’m saying is at least in some tension with the traditional story of indirect normativity.
I also agree with all this.
Rather than trying to give the AI very general instructions for its interpretation, I’m saying that we should in the first instance try to stabilize the world so that we can do more metaphilosophical reasoning ourselves before trying to program an AI system that can carry out the conclusions of that thinking or perhaps continue it.
I also agree with all this. I never imagined giving the corrigible AI extremely general instructions for its interpretation.
Yes, an approval-directed agent might reward-hack by causing the human to approve of things that it does not value. And it might compromise the humans’ reasoning abilities while doing so. But why must the AI system’s metaphilosophical competence be the only defeator? Why couldn’t this be achieved by quantilizing, or otherwise throttling the agent’s capabilities? By restricting the agent’s activities to some narrow domain? By having the agent somehow be deeply uncertain about where the human’s approval mechanism resides? None of these seems clearly viable, but neither do any of them seem clearly impossible, especially in cases where the AI system’s capabilities are overall not far beyond those of its human operators.
I think this captures my biggest update from your comment, and modified the ending of this post to reflect this. Throttling the AI’s power seems more viable than I’d previously thought, and seems like a pretty good way to significantly lower the risk of manipulation. That said, I think even extraordinary human persuadors might compromise human reasoning abilities, and I have fast takeoff intuitions that make it very hard for me to imagine an AGI that simultaneously
understands humans well enough to be corrigible
is superhumanly intelligent at engineering or global strategy
isn’t superhumanly capable of persuasion
wouldn’t corrupt humans (even if it tried to not corrupt them)
I haven’t thought too much about this though, and this might just be a failure of my imagination.
Overall, I’d say superintelligent messiahs are sometimes corrigible, and they’re more likely to be aligned if so.
Agreed. Does the new title seem better? I was mostly trying to explicate a distinction between corrigibility and alignment, which was maybe obvious to you beforehand, and illustrate the astronomical waste that can result even if we avoid self-annihilation.
A corrigible AI might not turn against its operators and might not kill us all, and the outcome can still be catastrophic. To prevent this, we’d definitely want our operators to be metaphilosophically competent, and we’d definitely want our AI to not corrupt them.
I agree with this.
a corrigible misaligned superintelligence is unlikely to lead to self-annihilation, but pretty likely to lead to astronomical moral waste.
There’s a lot of broad model uncertainty here, but yes, I’m sympathetic to this position.
Does the new title seem better?
Yep.
At this round of edits, my main objection would be to the remark that the AI wants us to act as yes-men, which seems dubious if the agent is (i) an Act-based agent or (ii) sufficiently broadly uncertain over values.
What I see to be the main message of the article as currently written is that humans controlling a very powerful tool (especially AI) could drive themselves into a suboptimal fixed point due to insufficient philosophical sophistication.
What I see to be the main message of the article as currently written is that humans controlling a very powerful tool (especially AI) could drive themselves into a suboptimal fixed point due to insufficient philosophical sophistication.
This I agree with.
Hurrah!
At this round of edits, my main objection would be to the remark that the AI wants us to act as yes-men, which seems dubious if the agent is (i) an Act-based agent or (ii) sufficiently broadly uncertain over values.
I no longer think it wants us to turn into yes-men, and edited my post accordingly. I still think it will be incentivized to corrupt us, and I don’t see how being an act-based agent would be sufficient, though it’s likely I’m missing something. I agree that if it’s sufficiently broadly uncertain over values then we’re likely to be fine, but in my head that unpacks into “if we knew the AI were metaphilosophically competent enough, we’d be fine”, which doesn’t help things much.
It seems to me that for a corrigible, moderately superhuman AI, it is mostly the metaphilosophical competence of the human that matters, rather than that of the AI system. I think there are a bunch of confusions presented here, and I’ll run through them, although let me disclaim that it’s Eliezer’s notion of corrigibility that I’m most familiar with, and so I’m arguing that your critiques fall flat on Eliezer’s version.
You omitted a key component of the quote that almost entirely reversis its meaning. The correct quote would read [emphasis added]: “[The AI should] help me figure out whether I built the right AI and correct any mistakes I made, remain informed about the AI’s behavior and avoid unpleasant surprises, make better decisions and clarify my preferences, acquire resources and remain in effective control of them, ensure that my AI systems continue to do all of these nice things...”. i.e. the AI should help with ensuring that the control continues to reside in the human, rather than in itself.
To my understanding, the point of corrigibility is that a corrigible system is supposed to benefit its human operators even if its intentions are somewhat wrong, so it is rather a non sequitur to say that an agent is corrigible because it has the best of intentions in its heart of hearts. If it truly fully understood human intentions and values, corrigibility may even be unnecessary.
Clearly you’re right that corrigibility is not sufficient for safety. A corrigible agent can still be instructed by its human operators to make a decision that is irreversibly bad. But it seems to help, and to help a lot. The point of a corrigible AI si that once it takes a few murderous actions, you can switch it off, or tell it to pursue a different objective. So for the messiah example, a corrigible messiah might poison a few followers and then when it is discovered, respond to an instruction to desist. An incorrigible messiah might be asked to stop murdering followers, but continue to do so anyway. So many of the more mundane existential risks would be mitigated by corrigibility.
And what about more exotic ones? I argue they would also be greatly (though not entirely) reduced. Consider that a corrigible messiah may still hide poison for all of the humans at once, leading to an irrevocably terrible outcome. But why should it? If it thinks it is doing well by the humans, then its harmful actions ought to be transparent. Perhaps the AI system would’s actions would not be transparent if it intelligence was so radically great that it was inclined to act in fast an incomprehensible ways. But it is hard to see how we could know with confidence that such a radically intelligent AI is the kind we will soon be dealing with. And even if we are going to deal with that kind of AI, there could be other remedies that would be especially helpful in such scenarios. For example, an AI that permits informed oversight of its activities could be superb if it was already corrigible. Then, it could not only provide truthful explanations of its future plans but also respond to feedback on them. Overall, if we had an AI system that was (1) only a little bit superhumanly smart, (2) corrigible, and (3) providing informative explanations of its planned behaviour, then it would seem that we are in a pretty good spot.
I think you’re making an important point here, but here is how I would put it: If you have an AI system that is properly deferential to humans, you still need to rely on the humans not to give it any existentially catastrophic orders. But the corrigibility/deferential behavior has changed the situation from one in which you’re relying on the metaphilosophical competence of the AI, to one in which you’re relying on the metaphilosphical competence on the human (albeit as filtered through the actions of the AI system). In the latter case, yes, you need to survive having a human’s power increased by some N-fold. (Not necessarily 10^15 as in the more extreme self-improvement scenarios, but by some N>1). So when you get a corrigible AI, you still need to be very careful with what you tell it to do, but your situation is substantially improved. Note that what I’m saying is at least in some tension with the traditional story of indirect normativity. Rather than trying to give the AI very general instructions for its interpretation, I’m saying that we should in the first instance try to stabilize the world so that we can do more metaphilosophical reasoning ourselves before trying to program an AI system that can carry out the conclusions of that thinking or perhaps continue it.
Yes, an approval-directed agent might reward-hack by causing the human to approve of things that it does not value. And it might compromise the humans’ reasoning abilities while doing so. But why must the AI system’s metaphilosophical competence be the only defeator? Why couldn’t this be achieved by quantilizing, or otherwise throttling the agent’s capabilities? By restricting the agent’s activities to some narrow domain? By having the agent somehow be deeply uncertain about where the human’s approval mechanism resides? None of these seems clearly viable, but neither do any of them seem clearly impossible, especially in cases where the AI system’s capabilities are overall not far beyond those of its human operators.
Overall, I’d say superintelligent messiahs are sometimes corrigible, and they’re more likely to be aligned if so.
Overall, my impression is that you thought I was saying, “A corrigible AI might turn against its operators and kill us all, and the only way to prevent that is by ensuring the AI is metaphilosophically competent.” I was really trying to say “A corrigible AI might not turn against its operators and might not kill us all, and the outcome can still be catastrophic. To prevent this, we’d definitely want our operators to be metaphilosophically competent, and we’d definitely want our AI to not corrupt them. The latter may be simple to ensure if the AI isn’t broadly superhumanly powerful, but may be difficult to ensure if the AI is broadly superhumanly powerful and we don’t have formal guarantees”. My sense is that we actually agree on the latter and agree that the former is wrong. Does this sound right? (I do think it’s concerning that my original post led you to reasonably interpret me as saying the former. Hopefully my edits make this clearer. I suspect part of what happened is that I think humans surviving but producing astronomical moral waste is about as bad as human extinction, so I didn’t bother delineating them, even though this is probably an unusual position.)
See below for individual responses.
I edited my post accordingly. This doesn’t change my perspective at all.
I gave that description to illustrate one way it is like a corrigible agent, which does have the best of intentions (to help its operators), not to imply that a well-intentioned agent is corrigible. I edited it to “In his heart of hearts, the messiah would be trying to help them, and everyone would know that.” Does that make it clearer?
I agree that corrigibility helps a bunch with mundane existential risks, and think that e.g. a corrigible misaligned superintelligence is unlikely to lead to self-annihilation, but pretty likely to lead to astronomical moral waste. I edited from “Surely, we wouldn’t build a superintelligence that would guide us down such an insidious path?” to “I don’t think a corrigible superintelligence would guide us down such an insidious path. I even think it would substantially improve the human condition, and would manage to avoid killing us all. But I think it might still lead us to astronomical moral waste.” Does this clarify things?
I agree with all this.
I also agree with all this.
I also agree with all this. I never imagined giving the corrigible AI extremely general instructions for its interpretation.
I think this captures my biggest update from your comment, and modified the ending of this post to reflect this. Throttling the AI’s power seems more viable than I’d previously thought, and seems like a pretty good way to significantly lower the risk of manipulation. That said, I think even extraordinary human persuadors might compromise human reasoning abilities, and I have fast takeoff intuitions that make it very hard for me to imagine an AGI that simultaneously
understands humans well enough to be corrigible
is superhumanly intelligent at engineering or global strategy
isn’t superhumanly capable of persuasion
wouldn’t corrupt humans (even if it tried to not corrupt them)
I haven’t thought too much about this though, and this might just be a failure of my imagination.
Agreed. Does the new title seem better? I was mostly trying to explicate a distinction between corrigibility and alignment, which was maybe obvious to you beforehand, and illustrate the astronomical waste that can result even if we avoid self-annihilation.
yep.
I agree with this.
There’s a lot of broad model uncertainty here, but yes, I’m sympathetic to this position.
Yep.
At this round of edits, my main objection would be to the remark that the AI wants us to act as yes-men, which seems dubious if the agent is (i) an Act-based agent or (ii) sufficiently broadly uncertain over values.
What I see to be the main message of the article as currently written is that humans controlling a very powerful tool (especially AI) could drive themselves into a suboptimal fixed point due to insufficient philosophical sophistication.
This I agree with.
Thanks Ryan!
Hurrah!
I no longer think it wants us to turn into yes-men, and edited my post accordingly. I still think it will be incentivized to corrupt us, and I don’t see how being an act-based agent would be sufficient, though it’s likely I’m missing something. I agree that if it’s sufficiently broadly uncertain over values then we’re likely to be fine, but in my head that unpacks into “if we knew the AI were metaphilosophically competent enough, we’d be fine”, which doesn’t help things much.