My guess is that a corrigibility-centric training process says ‘Don’t get the ice cream’ is the correct completion, whereas full alignment says ‘Do’. So that’s an instance where the training processes for CAST and FA differ. How about DWIM? I’d guess DWIM also says ‘Don’t get the ice cream’, and so seems like a closer match for CAST.
To distinguish corrigibility from DWIM in a similar sort of way:
Alice, the principal, sends you, her agent, to the store to buy groceries. You are doing what she meant by that (after checking uncertain details). But as you are out shopping, you realize that you have spare compute—your mind is free to think about a variety of things. You decide to think about ___.
I’m honestly not sure what “DWIM” does here. Perhaps it doesn’t think? Perhaps it keeps checking over and over again that it’s doing what was meant? Perhaps it thinks about its environment in an effort to spot obstacles that need to be surmounted in order to do what was meant? Perhaps it thinks about generalized ways to accumulate resources in case an obstacle presents itself? (I’ll loop in Seth Herd, in case he has a good answer.)
More directly, I see DWIM as underspecified. Corrigibility gives a clear answer (albeit an abstract one) about how to use degrees of freedom in general (e.g. spare thoughts should be spent reflecting on opportunities to empower the principal and steer away from principal-agent style problems). I expect corrigible agents to DWIM, but that a training process that focuses on that, rather than the underlying generator (i.e. corrigibility) to be potentially catastrophic by producing e.g. agents that subtly manipulate their principals in the process of being obedient.
I think DWIM is underspecified in that it doesn’t say how much the agent hates to get it wrong. With enough aversion to dramatic failure, you get a lot of the caution you mention for corrigibility. I think corrigibility might have the same issue.
As for what it would think about, that would eppend on all of the previous instructions it’s trying to follow. It would probably think about how to get better at following some.of those in particular or likely future instructions in general.
DWIM requires some real thought from the principal, but given that, I think the instructions would probably add up to something very like corrigibility. So I think much less about the difference between them and much more about how to technically implement either of them, and get the people creating AGI to put it into practice.
Thanks, this comment is also clarifying for me.
My guess is that a corrigibility-centric training process says ‘Don’t get the ice cream’ is the correct completion, whereas full alignment says ‘Do’. So that’s an instance where the training processes for CAST and FA differ. How about DWIM? I’d guess DWIM also says ‘Don’t get the ice cream’, and so seems like a closer match for CAST.
That matches my sense of things.
To distinguish corrigibility from DWIM in a similar sort of way:
I’m honestly not sure what “DWIM” does here. Perhaps it doesn’t think? Perhaps it keeps checking over and over again that it’s doing what was meant? Perhaps it thinks about its environment in an effort to spot obstacles that need to be surmounted in order to do what was meant? Perhaps it thinks about generalized ways to accumulate resources in case an obstacle presents itself? (I’ll loop in Seth Herd, in case he has a good answer.)
More directly, I see DWIM as underspecified. Corrigibility gives a clear answer (albeit an abstract one) about how to use degrees of freedom in general (e.g. spare thoughts should be spent reflecting on opportunities to empower the principal and steer away from principal-agent style problems). I expect corrigible agents to DWIM, but that a training process that focuses on that, rather than the underlying generator (i.e. corrigibility) to be potentially catastrophic by producing e.g. agents that subtly manipulate their principals in the process of being obedient.
I think DWIM is underspecified in that it doesn’t say how much the agent hates to get it wrong. With enough aversion to dramatic failure, you get a lot of the caution you mention for corrigibility. I think corrigibility might have the same issue.
As for what it would think about, that would eppend on all of the previous instructions it’s trying to follow. It would probably think about how to get better at following some.of those in particular or likely future instructions in general.
DWIM requires some real thought from the principal, but given that, I think the instructions would probably add up to something very like corrigibility. So I think much less about the difference between them and much more about how to technically implement either of them, and get the people creating AGI to put it into practice.