I don’t understand the assumptions behind the corrigibility problem. According to the intelligence explosion thesis, a self-improving AI will spend very little time in the near-human intelligence interval. Thus, most of the time it will be either far subhuman or far superhuman. In the far subhuman region the AIs manipulations against its programmers don’t seem to be a concern. In the far superhuman region fixing bugs seems to be way too late. In addition, it seems infeasible to debug the AI at this stage since it would have rewritten its own source codes into something humans probably cannot understand.
When building an AGI, it’s quite prudent to expect that you didn’t get everything exactly right on the first try. Therefore, it’s important to build systems that are amenable to modification, that don’t tile the universe halfway through value loading, etc. etc. In other words, even if you could kick off an intelligence explosion that quickly makes a system which you have no control over, this is probably a bad plan if you haven’t done a whole hell of a lot of hard work verifying the system and ensuring that it is aligned with your interests first, and so on. Corrigibility is the study of reasoning systems that are amenable to modification in that window between “starting to build models of its operators” and “it doesn’t matter what you try to do anymore.”
You might be right that that window would be small by default, but it’s pretty important to make that window as wide as possible in order to attain good outcomes.
Let me see if I understood the assumptions correctly:
We have a way of keeping the AGI’s evolution in check so that it arrives at near-human level but doesn’t go far superhuman. For example, we limit the available RAM, there is a theorem which produces a spatial complexity lower bound per given level of intelligence (rigorously quantified in some way) and there is a way to measure human intelligence on the same scale. Alternatively, the amount of RAM doesn’t give a strong enough bound by itself but it does combined with a limit on evolution time starting from seed AI.
We are reasonably confident the AGI follows the utility function we program into it and this property is stable with respect to self-modification.
We are not reasonably confidently the utility function used in practice is actually friendly (although a serious attempt to make it friendly has been made).
We are reasonably confident in the ability to formally describe conditions such as “shutdown when button is pressed”.
Those points roughly describe the assumptions you’d have to make to think that the shutdown problem in particular (and “solutions” such as the one in the paper) were valid, I suppose. The study of corrigibility more broadly does not depend upon these assumptions necessarily, though—the overall question that the study of corrigibility attempts to answer is this: given that your utility function probably won’t be “friendly” on the first try, what sort of system do you want to build? As you’ve noticed, there are many open problems in this space, including:
How do you build a system such that you can be reasonably confident that you know what it’s going to optimize in the first place?
How do you build something that doesn’t undergo a hard takeoff when you’re not looking?
How do you specify something like “shut down” (or “have a low impact”), given that there’s no privileged null action?
How do you build a generally intelligent system that doesn’t optimize too much in any direction?
How do you build a system that avoids the default incentives to manipulate/deceive?
These are all FAI problems that have nothing to do with directly specifying a “friendly utility function.” (5) is definitely a corrigibility problem; (2), (3), and (4) could be seen as loosely falling under the wide “corrigibility” dilemma, and (1) I’d characterize as a “highly reliable agent designs” problem. The corrigibility paper pretty much only touches on (5) (and even then indirectly); it’s better to look at that formalization of the “shutdown problem” not as an attempt at an answer to these problems, per se, but as an early attempt to carve out a toy model of the problem where we can start messing around and looking for insights that generalize to the problems at large.
Regarding problem 5, one approach I thought of is what I call “epistemic boxing”. Namely, we put the AGI in a virtual world (“box”) and program it to optimize utility expectation value over a “hard-coded” (stochastic) model of the box rather than over a Solomonoff measure. This assumes the utility function is given explicitly in terms of the box’s degrees of freedom.
Such an AGI can still recursively self-improve and become superintelligent, however it will never escape the box since the possibility is a non-sequitur in its epistemology. In particular, the box can have external inputs but the AGI will model them as e.g. random noise and won’t attempt to continue whatever pattern they contain (it will always consider it “accidental”).
Regarding question 2, I think there is a non-negligible probability it is unsolvable. That is not to say we shouldn’t look for solutions but IMO we should be prepared for the possibility there are none.
Hi Nate, interesting work.
I don’t understand the assumptions behind the corrigibility problem. According to the intelligence explosion thesis, a self-improving AI will spend very little time in the near-human intelligence interval. Thus, most of the time it will be either far subhuman or far superhuman. In the far subhuman region the AIs manipulations against its programmers don’t seem to be a concern. In the far superhuman region fixing bugs seems to be way too late. In addition, it seems infeasible to debug the AI at this stage since it would have rewritten its own source codes into something humans probably cannot understand.
When building an AGI, it’s quite prudent to expect that you didn’t get everything exactly right on the first try. Therefore, it’s important to build systems that are amenable to modification, that don’t tile the universe halfway through value loading, etc. etc. In other words, even if you could kick off an intelligence explosion that quickly makes a system which you have no control over, this is probably a bad plan if you haven’t done a whole hell of a lot of hard work verifying the system and ensuring that it is aligned with your interests first, and so on. Corrigibility is the study of reasoning systems that are amenable to modification in that window between “starting to build models of its operators” and “it doesn’t matter what you try to do anymore.”
You might be right that that window would be small by default, but it’s pretty important to make that window as wide as possible in order to attain good outcomes.
Thx for replying!
Let me see if I understood the assumptions correctly:
We have a way of keeping the AGI’s evolution in check so that it arrives at near-human level but doesn’t go far superhuman. For example, we limit the available RAM, there is a theorem which produces a spatial complexity lower bound per given level of intelligence (rigorously quantified in some way) and there is a way to measure human intelligence on the same scale. Alternatively, the amount of RAM doesn’t give a strong enough bound by itself but it does combined with a limit on evolution time starting from seed AI.
We are reasonably confident the AGI follows the utility function we program into it and this property is stable with respect to self-modification.
We are not reasonably confidently the utility function used in practice is actually friendly (although a serious attempt to make it friendly has been made).
We are reasonably confident in the ability to formally describe conditions such as “shutdown when button is pressed”.
Is this about right?
Those points roughly describe the assumptions you’d have to make to think that the shutdown problem in particular (and “solutions” such as the one in the paper) were valid, I suppose. The study of corrigibility more broadly does not depend upon these assumptions necessarily, though—the overall question that the study of corrigibility attempts to answer is this: given that your utility function probably won’t be “friendly” on the first try, what sort of system do you want to build? As you’ve noticed, there are many open problems in this space, including:
How do you build a system such that you can be reasonably confident that you know what it’s going to optimize in the first place?
How do you build something that doesn’t undergo a hard takeoff when you’re not looking?
How do you specify something like “shut down” (or “have a low impact”), given that there’s no privileged null action?
How do you build a generally intelligent system that doesn’t optimize too much in any direction?
How do you build a system that avoids the default incentives to manipulate/deceive?
These are all FAI problems that have nothing to do with directly specifying a “friendly utility function.” (5) is definitely a corrigibility problem; (2), (3), and (4) could be seen as loosely falling under the wide “corrigibility” dilemma, and (1) I’d characterize as a “highly reliable agent designs” problem. The corrigibility paper pretty much only touches on (5) (and even then indirectly); it’s better to look at that formalization of the “shutdown problem” not as an attempt at an answer to these problems, per se, but as an early attempt to carve out a toy model of the problem where we can start messing around and looking for insights that generalize to the problems at large.
Thanks, that is a good explanation.
Regarding problem 5, one approach I thought of is what I call “epistemic boxing”. Namely, we put the AGI in a virtual world (“box”) and program it to optimize utility expectation value over a “hard-coded” (stochastic) model of the box rather than over a Solomonoff measure. This assumes the utility function is given explicitly in terms of the box’s degrees of freedom.
Such an AGI can still recursively self-improve and become superintelligent, however it will never escape the box since the possibility is a non-sequitur in its epistemology. In particular, the box can have external inputs but the AGI will model them as e.g. random noise and won’t attempt to continue whatever pattern they contain (it will always consider it “accidental”).
Regarding question 2, I think there is a non-negligible probability it is unsolvable. That is not to say we shouldn’t look for solutions but IMO we should be prepared for the possibility there are none.