this seems closer to ‘overseer training the AI to be corrigible to a generic user and then “plugging in” a real user into the system at a later time’.
The overseer asks the question “what should the agent do [to be corrigible to the Google customer Alice it is currently working for]?”, and indeed even at training time the overseer is training the system to answer this question. There is no swapping out at test time. (The distributions at train and test time are identical, and I normally talk about the version where you keep training online.)
When the user asks a question to the agent it is being answered by indirection, by using the question-answering system to answer “what should the agent do [in the situation when it has been asked question Q by the user]?”
The overseer asks the question “what should the agent do [to be corrigible to the Google customer Alice it is currently working for]?“
Ok, I’ve been trying to figure out what would make the most sense and came to the same conclusion. I would also note that this “corrigible” is substantially different from the “corrigible” in “the AI is corrigible to the question asker” because it has to be an explicit form of corribility that is limited by things like corporate policy. For example if Alice asks “What are your design specs and source code?” or “How do I hack into this bank?” then the AI wouldn’t answer even though it’s supposed to be “corrigible” to the user, right? Maybe we need modifiers to indicate which corrigibility we’re talking about, like “full corrigibility” vs “limited corrigibility”?
ETA: Actually, does it even make sense to use the word “corrigible” in “to be corrigible to the Google customer Alice it is currently working for”? Originally “corrigible” meant:
A corrigible agent experiences no preference or instrumental pressure to interfere with attempts by the programmers or operators to modify the agent, impede its operation, or halt its execution.
But obviously Google’s AI is not going to allow a user to “modify the agent, impede its operation, or halt its execution”. Why use “corrigible” here instead of different language altogether, like “helpful to the extent allowed by Google policies”?
The overseer asks the question “what should the agent do [to be corrigible to the Google customer Alice it is currently working for]?”, and indeed even at training time the overseer is training the system to answer this question. There is no swapping out at test time. (The distributions at train and test time are identical, and I normally talk about the version where you keep training online.)
When the user asks a question to the agent it is being answered by indirection, by using the question-answering system to answer “what should the agent do [in the situation when it has been asked question Q by the user]?”
Ok, I’ve been trying to figure out what would make the most sense and came to the same conclusion. I would also note that this “corrigible” is substantially different from the “corrigible” in “the AI is corrigible to the question asker” because it has to be an explicit form of corribility that is limited by things like corporate policy. For example if Alice asks “What are your design specs and source code?” or “How do I hack into this bank?” then the AI wouldn’t answer even though it’s supposed to be “corrigible” to the user, right? Maybe we need modifiers to indicate which corrigibility we’re talking about, like “full corrigibility” vs “limited corrigibility”?
ETA: Actually, does it even make sense to use the word “corrigible” in “to be corrigible to the Google customer Alice it is currently working for”? Originally “corrigible” meant:
But obviously Google’s AI is not going to allow a user to “modify the agent, impede its operation, or halt its execution”. Why use “corrigible” here instead of different language altogether, like “helpful to the extent allowed by Google policies”?