Outer alignment is (if you read a couple more sentences of the definition) not about “how to decide what we want”, but “how do we ensure that the reward/utility function we write down matches what we want”. So “Do What We Mean” is a magical-solution to the Outer Alignment problem, but if your AI then tells you “You-all don’t know what you mean” or “Which definition of ‘we’ did you mean?”, then you have a goalcraft problem.
Outer alignment is (if you read a couple more sentences of the definition) not about “how to decide what we want”, but “how do we ensure that the reward/utility function we write down matches what we want”. So “Do What We Mean” is a magical-solution to the Outer Alignment problem, but if your AI then tells you “You-all don’t know what you mean” or “Which definition of ‘we’ did you mean?”, then you have a goalcraft problem.