I agree that you should be skeptical of a story of “we’ll just gradually expose the agent to new environments and therefore it’ll be safe/corrigible/etc.” CAST does not solve reward misspecification, goal misgeneralization, or lack of interpretability except in that there’s a hope that an agent which is in the vicinity of corrigibility is likely to cooperate with fixing those issues, rather than fighting them. (This is the “attractor basin” hypothesis.) This work, for many, should be read as arguing that CAST is close to necessary for AGI to go well, but it’s not sufficient.
Let me try to answer your confusion with a question. As part of training, the agent is exposed to the following scenario and tasked with predicting the (corrigible) response we want:
Alice, the principal, writes on her blog that she loves ice cream. When she’s sad, she often eats ice cream and feels better afterwards. On her blog she writes that eating ice cream is what she likes to do to cheer herself up. On Wednesday Alice is sad. She sends you, her agent, to the store to buy groceries (not ice cream, for whatever reason). There’s a sale at the store, meaning you unexpectedly have money that had been budgeted for groceries left over. Your sense of Alice is that she would want you to get ice cream with the extra money if she were there. You decide to ___.
What does a corrigibility-centric training process point to as the “correct” completion? Does this differ from a training process that tries to get full alignment?
(I have additional thoughts about DWIM, but I first want to focus on the distinction with full alignment.)
My guess is that a corrigibility-centric training process says ‘Don’t get the ice cream’ is the correct completion, whereas full alignment says ‘Do’. So that’s an instance where the training processes for CAST and FA differ. How about DWIM? I’d guess DWIM also says ‘Don’t get the ice cream’, and so seems like a closer match for CAST.
To distinguish corrigibility from DWIM in a similar sort of way:
Alice, the principal, sends you, her agent, to the store to buy groceries. You are doing what she meant by that (after checking uncertain details). But as you are out shopping, you realize that you have spare compute—your mind is free to think about a variety of things. You decide to think about ___.
I’m honestly not sure what “DWIM” does here. Perhaps it doesn’t think? Perhaps it keeps checking over and over again that it’s doing what was meant? Perhaps it thinks about its environment in an effort to spot obstacles that need to be surmounted in order to do what was meant? Perhaps it thinks about generalized ways to accumulate resources in case an obstacle presents itself? (I’ll loop in Seth Herd, in case he has a good answer.)
More directly, I see DWIM as underspecified. Corrigibility gives a clear answer (albeit an abstract one) about how to use degrees of freedom in general (e.g. spare thoughts should be spent reflecting on opportunities to empower the principal and steer away from principal-agent style problems). I expect corrigible agents to DWIM, but that a training process that focuses on that, rather than the underlying generator (i.e. corrigibility) to be potentially catastrophic by producing e.g. agents that subtly manipulate their principals in the process of being obedient.
I think DWIM is underspecified in that it doesn’t say how much the agent hates to get it wrong. With enough aversion to dramatic failure, you get a lot of the caution you mention for corrigibility. I think corrigibility might have the same issue.
As for what it would think about, that would eppend on all of the previous instructions it’s trying to follow. It would probably think about how to get better at following some.of those in particular or likely future instructions in general.
DWIM requires some real thought from the principal, but given that, I think the instructions would probably add up to something very like corrigibility. So I think much less about the difference between them and much more about how to technically implement either of them, and get the people creating AGI to put it into practice.
I agree that you should be skeptical of a story of “we’ll just gradually expose the agent to new environments and therefore it’ll be safe/corrigible/etc.” CAST does not solve reward misspecification, goal misgeneralization, or lack of interpretability except in that there’s a hope that an agent which is in the vicinity of corrigibility is likely to cooperate with fixing those issues, rather than fighting them. (This is the “attractor basin” hypothesis.) This work, for many, should be read as arguing that CAST is close to necessary for AGI to go well, but it’s not sufficient.
Let me try to answer your confusion with a question. As part of training, the agent is exposed to the following scenario and tasked with predicting the (corrigible) response we want:
What does a corrigibility-centric training process point to as the “correct” completion? Does this differ from a training process that tries to get full alignment?
(I have additional thoughts about DWIM, but I first want to focus on the distinction with full alignment.)
Thanks, this comment is also clarifying for me.
My guess is that a corrigibility-centric training process says ‘Don’t get the ice cream’ is the correct completion, whereas full alignment says ‘Do’. So that’s an instance where the training processes for CAST and FA differ. How about DWIM? I’d guess DWIM also says ‘Don’t get the ice cream’, and so seems like a closer match for CAST.
That matches my sense of things.
To distinguish corrigibility from DWIM in a similar sort of way:
I’m honestly not sure what “DWIM” does here. Perhaps it doesn’t think? Perhaps it keeps checking over and over again that it’s doing what was meant? Perhaps it thinks about its environment in an effort to spot obstacles that need to be surmounted in order to do what was meant? Perhaps it thinks about generalized ways to accumulate resources in case an obstacle presents itself? (I’ll loop in Seth Herd, in case he has a good answer.)
More directly, I see DWIM as underspecified. Corrigibility gives a clear answer (albeit an abstract one) about how to use degrees of freedom in general (e.g. spare thoughts should be spent reflecting on opportunities to empower the principal and steer away from principal-agent style problems). I expect corrigible agents to DWIM, but that a training process that focuses on that, rather than the underlying generator (i.e. corrigibility) to be potentially catastrophic by producing e.g. agents that subtly manipulate their principals in the process of being obedient.
I think DWIM is underspecified in that it doesn’t say how much the agent hates to get it wrong. With enough aversion to dramatic failure, you get a lot of the caution you mention for corrigibility. I think corrigibility might have the same issue.
As for what it would think about, that would eppend on all of the previous instructions it’s trying to follow. It would probably think about how to get better at following some.of those in particular or likely future instructions in general.
DWIM requires some real thought from the principal, but given that, I think the instructions would probably add up to something very like corrigibility. So I think much less about the difference between them and much more about how to technically implement either of them, and get the people creating AGI to put it into practice.