Those points roughly describe the assumptions you’d have to make to think that the shutdown problem in particular (and “solutions” such as the one in the paper) were valid, I suppose. The study of corrigibility more broadly does not depend upon these assumptions necessarily, though—the overall question that the study of corrigibility attempts to answer is this: given that your utility function probably won’t be “friendly” on the first try, what sort of system do you want to build? As you’ve noticed, there are many open problems in this space, including:
How do you build a system such that you can be reasonably confident that you know what it’s going to optimize in the first place?
How do you build something that doesn’t undergo a hard takeoff when you’re not looking?
How do you specify something like “shut down” (or “have a low impact”), given that there’s no privileged null action?
How do you build a generally intelligent system that doesn’t optimize too much in any direction?
How do you build a system that avoids the default incentives to manipulate/deceive?
These are all FAI problems that have nothing to do with directly specifying a “friendly utility function.” (5) is definitely a corrigibility problem; (2), (3), and (4) could be seen as loosely falling under the wide “corrigibility” dilemma, and (1) I’d characterize as a “highly reliable agent designs” problem. The corrigibility paper pretty much only touches on (5) (and even then indirectly); it’s better to look at that formalization of the “shutdown problem” not as an attempt at an answer to these problems, per se, but as an early attempt to carve out a toy model of the problem where we can start messing around and looking for insights that generalize to the problems at large.
Regarding problem 5, one approach I thought of is what I call “epistemic boxing”. Namely, we put the AGI in a virtual world (“box”) and program it to optimize utility expectation value over a “hard-coded” (stochastic) model of the box rather than over a Solomonoff measure. This assumes the utility function is given explicitly in terms of the box’s degrees of freedom.
Such an AGI can still recursively self-improve and become superintelligent, however it will never escape the box since the possibility is a non-sequitur in its epistemology. In particular, the box can have external inputs but the AGI will model them as e.g. random noise and won’t attempt to continue whatever pattern they contain (it will always consider it “accidental”).
Regarding question 2, I think there is a non-negligible probability it is unsolvable. That is not to say we shouldn’t look for solutions but IMO we should be prepared for the possibility there are none.
Those points roughly describe the assumptions you’d have to make to think that the shutdown problem in particular (and “solutions” such as the one in the paper) were valid, I suppose. The study of corrigibility more broadly does not depend upon these assumptions necessarily, though—the overall question that the study of corrigibility attempts to answer is this: given that your utility function probably won’t be “friendly” on the first try, what sort of system do you want to build? As you’ve noticed, there are many open problems in this space, including:
How do you build a system such that you can be reasonably confident that you know what it’s going to optimize in the first place?
How do you build something that doesn’t undergo a hard takeoff when you’re not looking?
How do you specify something like “shut down” (or “have a low impact”), given that there’s no privileged null action?
How do you build a generally intelligent system that doesn’t optimize too much in any direction?
How do you build a system that avoids the default incentives to manipulate/deceive?
These are all FAI problems that have nothing to do with directly specifying a “friendly utility function.” (5) is definitely a corrigibility problem; (2), (3), and (4) could be seen as loosely falling under the wide “corrigibility” dilemma, and (1) I’d characterize as a “highly reliable agent designs” problem. The corrigibility paper pretty much only touches on (5) (and even then indirectly); it’s better to look at that formalization of the “shutdown problem” not as an attempt at an answer to these problems, per se, but as an early attempt to carve out a toy model of the problem where we can start messing around and looking for insights that generalize to the problems at large.
Thanks, that is a good explanation.
Regarding problem 5, one approach I thought of is what I call “epistemic boxing”. Namely, we put the AGI in a virtual world (“box”) and program it to optimize utility expectation value over a “hard-coded” (stochastic) model of the box rather than over a Solomonoff measure. This assumes the utility function is given explicitly in terms of the box’s degrees of freedom.
Such an AGI can still recursively self-improve and become superintelligent, however it will never escape the box since the possibility is a non-sequitur in its epistemology. In particular, the box can have external inputs but the AGI will model them as e.g. random noise and won’t attempt to continue whatever pattern they contain (it will always consider it “accidental”).
Regarding question 2, I think there is a non-negligible probability it is unsolvable. That is not to say we shouldn’t look for solutions but IMO we should be prepared for the possibility there are none.