Seth Herd comments on 0. CAST: Corrigibility as Singular Target

Seth Herd 13 Jun 2024 18:48 UTC
8 points
0
I’m so glad to see this published!
I think by “corrigibility” here you mean: an agent whose goal is to do what their principal wants. Their goal is basically a pointer to someone else’s goal.
This is a bit counter-intuitive because no human has this goal. And because, unlike the consequentialist, state-of-the-world goals we usually discuss, this goal can and will change over time.
Despite being counter-intuitive, this all seems logically consistent to me.
The key insight here is that corrigibility is consistent and seems workable IF it’s the primary goal. Corrigibility is unnatural if the agent has consequentialist goals that take precedence over being corrigible.
I’ve been trying to work through a similar proposal, instruction-following or do-what-I-mean as the primary goal for AGI. It’s different, but I think most of the strengths and weaknesses are the same relative to other alignment proposals. I’m not sure myself which is a better idea. I have been focusing on the instruction-following variant because I think it’s a more likely plan, whether or not it’s a good one. It seems likely to be the default alignment plan for language model agent type AGI efforts in the near future. That approach might not work, but assuming it won’t seems like a huge mistake for the alignment project.