[Epistemic status: Unpolished conceptual exploration, possibly of concepts that are extremely obvious and/or have already been discussed. Abandoning concerns about obviousness, previous discussion, polish, fitting the list-of-principles frame, etc. in favor of saying anything at all.] [ETA: Written in about half an hour, with some distraction and wording struggles.]
What is the hypothetical ideal of a corrigible AI? Without worrying about whether it can be implemented in practice or is even tractable to design, just as a theoretical reference to compare proposals to?
I propose that the hypothetical ideal is not an AI that lets the programmer shut it down, but an AI that wants to be corrected—one that will allow a programmer to work on it while it is live and aid the programmer by honestly explaining the results of any changes. It is entirely plausible that this is not achievable by currently-known techniques because we don’t know how to do “caring about a world-state rather than a sensory input / reward signal,” let alone “actually wanting to fulfill human values but being uncertain about those values”, but this still seems to me the ideal.
Suppose such an AI is asked to place a strawberry on the bottom plate of a stack of plates. It would rather set the rest of the stack aside non-destructively than smash them, because it is uncertain about what humans would prefer to be done with those plates and leaving them intact allows more future options. It would rather take the strawberry from a nearby bowl than creating a new strawberry plantation, because it is uncertain about what humans would prefer to be done with the resources that would be directed towards a new strawberry plantation. Likewise, it would rather not run off and develop nanofabrication. It would rather make a decent attempt and then ask the human for feedback instead of turning Earth into computronium to verify the placement of the strawberry, because again, uncertainty over ideal use of resources. It would rather not deceive the human asker or the programmer, because deceiving humans reduces the expected value of future corrections. This seems to me to be what is desired of considerations like “low impact”, “myopia”, “task uncertainty”, “satisficing” …
The list of principles should flow from considering the ideal and obstacles to getting there, along with security-mindset considerations. Just because you believe your AI is safe given unfettered Internet access doesn’t mean you should give it unfettered Internet access—but if you don’t believe your AI is safe given unfettered Internet access, this is a red flag that it is “working against you” on some level.
[Epistemic status: Unpolished conceptual exploration, possibly of concepts that are extremely obvious and/or have already been discussed. Abandoning concerns about obviousness, previous discussion, polish, fitting the list-of-principles frame, etc. in favor of saying anything at all.] [ETA: Written in about half an hour, with some distraction and wording struggles.]
What is the hypothetical ideal of a corrigible AI? Without worrying about whether it can be implemented in practice or is even tractable to design, just as a theoretical reference to compare proposals to?
I propose that the hypothetical ideal is not an AI that lets the programmer shut it down, but an AI that wants to be corrected—one that will allow a programmer to work on it while it is live and aid the programmer by honestly explaining the results of any changes. It is entirely plausible that this is not achievable by currently-known techniques because we don’t know how to do “caring about a world-state rather than a sensory input / reward signal,” let alone “actually wanting to fulfill human values but being uncertain about those values”, but this still seems to me the ideal.
Suppose such an AI is asked to place a strawberry on the bottom plate of a stack of plates. It would rather set the rest of the stack aside non-destructively than smash them, because it is uncertain about what humans would prefer to be done with those plates and leaving them intact allows more future options. It would rather take the strawberry from a nearby bowl than creating a new strawberry plantation, because it is uncertain about what humans would prefer to be done with the resources that would be directed towards a new strawberry plantation. Likewise, it would rather not run off and develop nanofabrication. It would rather make a decent attempt and then ask the human for feedback instead of turning Earth into computronium to verify the placement of the strawberry, because again, uncertainty over ideal use of resources. It would rather not deceive the human asker or the programmer, because deceiving humans reduces the expected value of future corrections. This seems to me to be what is desired of considerations like “low impact”, “myopia”, “task uncertainty”, “satisficing” …
The list of principles should flow from considering the ideal and obstacles to getting there, along with security-mindset considerations. Just because you believe your AI is safe given unfettered Internet access doesn’t mean you should give it unfettered Internet access—but if you don’t believe your AI is safe given unfettered Internet access, this is a red flag that it is “working against you” on some level.