While rereading the List of Lethalities (LoL), I was compelled by the argument against corrigibility. It’s really hard to make a goal of “maximize X, except if someone tells you to shut down”. I think the same argument applies to Christiano’s goal of achieving corrigibility through RL by rewarding correlates of corrigibility. If other things are rewarded more reliably, you may not get your AGI to shut down when you need it to.
But those arguments don’t apply if corrigibility in the broad sense is the primary goal. “Doing what this guy means by what he says” is a perfectly coherent goal. And it’s a highly attractive one, for a few reasons. Perhaps corrigibility shouldn’t be used in this sense and do what I mean (DWIM) is a better term. But it’s closely related. It accomplishes corrigibility, and has other advantages. I think it’s fairly likely to be the first goal someone actually gives an AGI.
“Do what I mean” sidesteps the difficulty of outer alignment. The difficulty of outer alignment is another point in the LoL. One common plan, which seems sensible, is to keep humans in the loop; to have a Long Reflection to decide what we want. “DWIM” allows you to contemplate and change your mind as much as you like.
Of course, the problem here is: do what WHO means? We’d like an AGI that serves all of humanity, not just one guy or board of directors. And we’d like to not have power struggles.
But from the point of view of a team actually deciding what goal to give their shot at AGI, DWIM will be incredibly attractive for practical reasons. The outer alignment problem is hard. Specifying one person (or a few) to take instructions from is vastly simpler than deciding and specifying a goal that captures all of human flourishing for all time. You don’t want to trust an AGI to interpret that goal correctly. Intepreting DWIM is still fraught, but it is naturally self-correcting, and becomes more useful as the AGI gets more capable. A smarter AGI will be better at understanding what you probably mean, and better at realizing when it’s not sure what you mean so it can ask for clarification.
This doesn’t at all address inner alignment. But when somebody thinks they have good-enough inner alignment to launch a goal-directed, sapient AGI, DWIM is likely to be the goal they’ll choose. This could be good or bad, depending on how well they’ve implemented inner alignment, and what type of people they are.
Corrigibility or DWIM is an attractive primary goal for AGI
While rereading the List of Lethalities (LoL), I was compelled by the argument against corrigibility. It’s really hard to make a goal of “maximize X, except if someone tells you to shut down”. I think the same argument applies to Christiano’s goal of achieving corrigibility through RL by rewarding correlates of corrigibility. If other things are rewarded more reliably, you may not get your AGI to shut down when you need it to.
But those arguments don’t apply if corrigibility in the broad sense is the primary goal. “Doing what this guy means by what he says” is a perfectly coherent goal. And it’s a highly attractive one, for a few reasons. Perhaps corrigibility shouldn’t be used in this sense and do what I mean (DWIM) is a better term. But it’s closely related. It accomplishes corrigibility, and has other advantages. I think it’s fairly likely to be the first goal someone actually gives an AGI.
“Do what I mean” sidesteps the difficulty of outer alignment. The difficulty of outer alignment is another point in the LoL. One common plan, which seems sensible, is to keep humans in the loop; to have a Long Reflection to decide what we want. “DWIM” allows you to contemplate and change your mind as much as you like.
Of course, the problem here is: do what WHO means? We’d like an AGI that serves all of humanity, not just one guy or board of directors. And we’d like to not have power struggles.
But from the point of view of a team actually deciding what goal to give their shot at AGI, DWIM will be incredibly attractive for practical reasons. The outer alignment problem is hard. Specifying one person (or a few) to take instructions from is vastly simpler than deciding and specifying a goal that captures all of human flourishing for all time. You don’t want to trust an AGI to interpret that goal correctly. Intepreting DWIM is still fraught, but it is naturally self-correcting, and becomes more useful as the AGI gets more capable. A smarter AGI will be better at understanding what you probably mean, and better at realizing when it’s not sure what you mean so it can ask for clarification.
This doesn’t at all address inner alignment. But when somebody thinks they have good-enough inner alignment to launch a goal-directed, sapient AGI, DWIM is likely to be the goal they’ll choose. This could be good or bad, depending on how well they’ve implemented inner alignment, and what type of people they are.