Note that humans play two distinct roles in IDA, and I think it’s important to separate them
This seems like a really important clarification, but in your article on corrigibility, you only ever talk about one human, the overseer, and the whole argument about “basin of attraction” seems to rely on having one human be both the trainer for corrigibility, target of corrigibility, and source of preferences:
But a corrigible agent prefers to build other agents that share the overseer’s preferences — even if the agent doesn’t yet share the overseer’s preferences perfectly. After all, even if you only approximately know the overseer’s preferences, you know that the overseer would prefer the approximation get better rather than worse.
I think in that post, the overseer is training the AI to specifically be corrigible to herself, which makes the AI aligned to herself. I’m not sure what is happening in the new scheme with two humans. Is the overseer now still training the AI to be corrigible to herself, which produces an AI that’s aligned to the overseer which then helps out the user because the overseer has a preference to help out the user? Or is the overseer training the AI to be corrigible to a generic user and then “plugging in” a real user into the system at a later time? If the latter, have you checked that the “basin of attraction” argument still applies? If it does, maybe that post needs to be rewritten to make that clearer?
This seems like a really important clarification, but in your article on corrigibility, you only ever talk about one human, the overseer, and the whole argument about “basin of attraction” seems to rely on having one human be both the trainer for corrigibility, target of corrigibility, and source of preferences:
Corrigibility plays a role both within amplification and in the final agent.
The post is mostly talking about the final agent without talking about IDA specifically.
The section titled Amplification is about the internal dynamics, where behavior is corrigible by the question-asker. It doesn’t seem important to me that these be the same. Corrigibility to the overseer only leads to corrigibility to the end user if the overseer is appropriately motivated. I usually imagine the overseer as something like a Google engineer and the end user as something like a visitor to google.com today. The resulting agent will likely be imperfectly corrigible because of the imperfect motives of Google engineers (this is pretty similar to human relationships around other technologies).
I’m no longer as convinced that corrigibility is the right abstraction for reasoning about internal behavior within amplification (but am still pretty convinced that it’s a good way to reason about the external behavior, and I do think “corrigible” is closer to what we want than “benign” was). I’ve been thinking about these issues recently and it will be touched on in an upcoming post.
Is the overseer now still training the AI to be corrigible to herself, which produces an AI that’s aligned to the overseer which then helps out the user because the overseer has a preference to help out the user?
This is basically right. I’m usually imagining the overseer training a general question-answering system, with the AI trained to be corrigible to the question-asker. We then use that question-answering system to implement a corrigible agent, by using it to answer questions like “What should the agent do next?” (with an appropriate specification of ‘should’), which is where external corrigibility comes in.
This is basically right. I’m usually imagining the overseer training a general question-answering system, with the AI trained to be corrigible to the question-asker.
This confuses me because you’re saying “basically right” to something but then you say something that seems very different, and which actually seems closer to the other option I was suggesting. Isn’t it very different for the overseer to train the AI to be corrigible to herself as a specific individual, versus training the AI to be corrigible to whoever is asking the current question? Since the AI can’t know who is asking the current question (which seems necessary to be corrigible to them?) without that being passed in as additional information, this seems closer to ‘overseer training the AI to be corrigible to a generic user and then “plugging in” a real user into the system at a later time’.
I also have a bunch of other confusions, but it’s probably easier to talk about them after resolving this one.
(Also, just in case, is there a difference between “corrigible to” and “corrigible by”?)
this seems closer to ‘overseer training the AI to be corrigible to a generic user and then “plugging in” a real user into the system at a later time’.
The overseer asks the question “what should the agent do [to be corrigible to the Google customer Alice it is currently working for]?”, and indeed even at training time the overseer is training the system to answer this question. There is no swapping out at test time. (The distributions at train and test time are identical, and I normally talk about the version where you keep training online.)
When the user asks a question to the agent it is being answered by indirection, by using the question-answering system to answer “what should the agent do [in the situation when it has been asked question Q by the user]?”
The overseer asks the question “what should the agent do [to be corrigible to the Google customer Alice it is currently working for]?“
Ok, I’ve been trying to figure out what would make the most sense and came to the same conclusion. I would also note that this “corrigible” is substantially different from the “corrigible” in “the AI is corrigible to the question asker” because it has to be an explicit form of corribility that is limited by things like corporate policy. For example if Alice asks “What are your design specs and source code?” or “How do I hack into this bank?” then the AI wouldn’t answer even though it’s supposed to be “corrigible” to the user, right? Maybe we need modifiers to indicate which corrigibility we’re talking about, like “full corrigibility” vs “limited corrigibility”?
ETA: Actually, does it even make sense to use the word “corrigible” in “to be corrigible to the Google customer Alice it is currently working for”? Originally “corrigible” meant:
A corrigible agent experiences no preference or instrumental pressure to interfere with attempts by the programmers or operators to modify the agent, impede its operation, or halt its execution.
But obviously Google’s AI is not going to allow a user to “modify the agent, impede its operation, or halt its execution”. Why use “corrigible” here instead of different language altogether, like “helpful to the extent allowed by Google policies”?
(Also, just in case, is there a difference between “corrigible to” and “corrigible by”?)
No. I was just saying “corrigible by” originally because that seems more grammatical, and sometimes saying “corrigible to” because it seems more natural. Probably “to” is better.
This seems like a really important clarification, but in your article on corrigibility, you only ever talk about one human, the overseer, and the whole argument about “basin of attraction” seems to rely on having one human be both the trainer for corrigibility, target of corrigibility, and source of preferences:
I think in that post, the overseer is training the AI to specifically be corrigible to herself, which makes the AI aligned to herself. I’m not sure what is happening in the new scheme with two humans. Is the overseer now still training the AI to be corrigible to herself, which produces an AI that’s aligned to the overseer which then helps out the user because the overseer has a preference to help out the user? Or is the overseer training the AI to be corrigible to a generic user and then “plugging in” a real user into the system at a later time? If the latter, have you checked that the “basin of attraction” argument still applies? If it does, maybe that post needs to be rewritten to make that clearer?
Corrigibility plays a role both within amplification and in the final agent.
The post is mostly talking about the final agent without talking about IDA specifically.
The section titled
Amplification
is about the internal dynamics, where behavior is corrigible by the question-asker. It doesn’t seem important to me that these be the same. Corrigibility to the overseer only leads to corrigibility to the end user if the overseer is appropriately motivated. I usually imagine the overseer as something like a Google engineer and the end user as something like a visitor to google.com today. The resulting agent will likely be imperfectly corrigible because of the imperfect motives of Google engineers (this is pretty similar to human relationships around other technologies).I’m no longer as convinced that corrigibility is the right abstraction for reasoning about internal behavior within amplification (but am still pretty convinced that it’s a good way to reason about the external behavior, and I do think “corrigible” is closer to what we want than “benign” was). I’ve been thinking about these issues recently and it will be touched on in an upcoming post.
This is basically right. I’m usually imagining the overseer training a general question-answering system, with the AI trained to be corrigible to the question-asker. We then use that question-answering system to implement a corrigible agent, by using it to answer questions like “What should the agent do next?” (with an appropriate specification of ‘should’), which is where external corrigibility comes in.
This confuses me because you’re saying “basically right” to something but then you say something that seems very different, and which actually seems closer to the other option I was suggesting. Isn’t it very different for the overseer to train the AI to be corrigible to herself as a specific individual, versus training the AI to be corrigible to whoever is asking the current question? Since the AI can’t know who is asking the current question (which seems necessary to be corrigible to them?) without that being passed in as additional information, this seems closer to ‘overseer training the AI to be corrigible to a generic user and then “plugging in” a real user into the system at a later time’.
I also have a bunch of other confusions, but it’s probably easier to talk about them after resolving this one.
(Also, just in case, is there a difference between “corrigible to” and “corrigible by”?)
The overseer asks the question “what should the agent do [to be corrigible to the Google customer Alice it is currently working for]?”, and indeed even at training time the overseer is training the system to answer this question. There is no swapping out at test time. (The distributions at train and test time are identical, and I normally talk about the version where you keep training online.)
When the user asks a question to the agent it is being answered by indirection, by using the question-answering system to answer “what should the agent do [in the situation when it has been asked question Q by the user]?”
Ok, I’ve been trying to figure out what would make the most sense and came to the same conclusion. I would also note that this “corrigible” is substantially different from the “corrigible” in “the AI is corrigible to the question asker” because it has to be an explicit form of corribility that is limited by things like corporate policy. For example if Alice asks “What are your design specs and source code?” or “How do I hack into this bank?” then the AI wouldn’t answer even though it’s supposed to be “corrigible” to the user, right? Maybe we need modifiers to indicate which corrigibility we’re talking about, like “full corrigibility” vs “limited corrigibility”?
ETA: Actually, does it even make sense to use the word “corrigible” in “to be corrigible to the Google customer Alice it is currently working for”? Originally “corrigible” meant:
But obviously Google’s AI is not going to allow a user to “modify the agent, impede its operation, or halt its execution”. Why use “corrigible” here instead of different language altogether, like “helpful to the extent allowed by Google policies”?
No. I was just saying “corrigible by” originally because that seems more grammatical, and sometimes saying “corrigible to” because it seems more natural. Probably “to” is better.