The technical definition for corrigibility being used here is thus: “We call an AI system “corrigible” if it cooperates with what its creators regard as a corrective intervention, despite default incentives for rational agents to resist attempts to shut them down or modify their preferences.”
And yes, the basic idea is to make it so that the agent can be correct by its operators after instantiation.
I think it matters what KIND of correction you’re considering. If there’s a term in the agent’s utility function to understand and work toward things that humans (or specific humans) value, you could make a correction either by altering the weights or other terms of the utility function, or by a simple knowledge update.
Those feel very different. are both required for “corrigibility”?
The “If there’s a term in the agent’s utility function to … work toward things that humans … value” part is the hard part. If you can figure out how to make it truly care what its operator wants, you’ve already solved a huge problem.
An agent would have to be corrigible even if you couldn’t manage to make it care explicitly what it’s operator wants. We need some way of taking agents that explicitly don’t care what their operators want, and making them not stop their operators from turning them off, despite default the incentives to prevent interference.
I’m not following. I think your definition of “care” is confusing me.
If you want an agent to care (have a term in it’s utility function) what you want, and if you can control it’s values, then you should just make it care what you want, not make it NOT care and then fix it later.
There is a very big gap between “I want it to care what I want, but I don’t yet know what I want so I need to be able to tell it what I want later and have it believe me” and “I want it not to care what I want but I want to later change my mind and force it to care what I want”.
“Just care what I want” is a separate, unsolved research problem. Corrigibility is an attempt to get an agent to simply not immediately kill its user even if it doesn’t necessarily have a good model of what that user wants.
“don’t kill an operator” seems like something that can more easily be encoded into an agent than “allow operators to correct things they consider undesirable when they notice them”.
In fact, even a perfectly corrigible agent with such a glaring initial flaw might kill the operator(s) before they can apply the corrections, not because they are resisting correction, but just because it furthers whatever other goals they may have.
You’re exactly right, I think. IMO it may actually be easier to build an AI that can learn to want what some target agent wants, than to build an AI that lets itself be interfered with by some operator whose goals don’t align with its own current goals.
The technical definition for corrigibility being used here is thus: “We call an AI system “corrigible” if it cooperates with what its creators regard as a corrective intervention, despite default incentives for rational agents to resist attempts to shut them down or modify their preferences.”
And yes, the basic idea is to make it so that the agent can be correct by its operators after instantiation.
I think it matters what KIND of correction you’re considering. If there’s a term in the agent’s utility function to understand and work toward things that humans (or specific humans) value, you could make a correction either by altering the weights or other terms of the utility function, or by a simple knowledge update.
Those feel very different. are both required for “corrigibility”?
The “If there’s a term in the agent’s utility function to … work toward things that humans … value” part is the hard part. If you can figure out how to make it truly care what its operator wants, you’ve already solved a huge problem.
An agent would have to be corrigible even if you couldn’t manage to make it care explicitly what it’s operator wants. We need some way of taking agents that explicitly don’t care what their operators want, and making them not stop their operators from turning them off, despite default the incentives to prevent interference.
I’m not following. I think your definition of “care” is confusing me.
If you want an agent to care (have a term in it’s utility function) what you want, and if you can control it’s values, then you should just make it care what you want, not make it NOT care and then fix it later.
There is a very big gap between “I want it to care what I want, but I don’t yet know what I want so I need to be able to tell it what I want later and have it believe me” and “I want it not to care what I want but I want to later change my mind and force it to care what I want”.
“Just care what I want” is a separate, unsolved research problem. Corrigibility is an attempt to get an agent to simply not immediately kill its user even if it doesn’t necessarily have a good model of what that user wants.
“don’t kill an operator” seems like something that can more easily be encoded into an agent than “allow operators to correct things they consider undesirable when they notice them”.
In fact, even a perfectly corrigible agent with such a glaring initial flaw might kill the operator(s) before they can apply the corrections, not because they are resisting correction, but just because it furthers whatever other goals they may have.
You’re exactly right, I think. IMO it may actually be easier to build an AI that can learn to want what some target agent wants, than to build an AI that lets itself be interfered with by some operator whose goals don’t align with its own current goals.