The case for coherent preferences is that eventually AGIs would want to do something about the cosmic endowment and the most efficient way of handling that is with strong optimization to a coherent goal, without worrying about goodharting. So it doesn’t matter how early AGIs think, formulating coherent preferences is also a convergent instrumental drive.
At the same time, if coherent preferences are only arrived-at later, that privileges certain shapes of the process that formulates them, which might make them non-arbitrary in ways relevant to humanity’s survival.
do you mean “most AI systems that don’t initially have coherent preferences, will eventually self-modify / evolve / follow some other process and become agents with coherent preferences”?
Somewhat. Less likely become/self-modify than simply build agents with coherent preferences distinct from the builders, for the purpose of efficiently managing the resources. But it’s those agents with coherent preferences that get to manage everything, so they are what matters for what happens. And if they tend to be built in particular convergent ways, perhaps arbitrary details of their builders are not as relevant to what happens.
“without worrying about goodharting” and “most efficient way of handling that is with strong optimization …” comes after you have coherent preferences, not before
That’s the argument for formulating the requisite coherent preferences, they are needed to perform strong optimization. And you want strong optimization because you have all this stuff lying around unoptimized.
The case for coherent preferences is that eventually AGIs would want to do something about the cosmic endowment and the most efficient way of handling that is with strong optimization to a coherent goal, without worrying about goodharting. So it doesn’t matter how early AGIs think, formulating coherent preferences is also a convergent instrumental drive.
At the same time, if coherent preferences are only arrived-at later, that privileges certain shapes of the process that formulates them, which might make them non-arbitrary in ways relevant to humanity’s survival.
Somewhat. Less likely become/self-modify than simply build agents with coherent preferences distinct from the builders, for the purpose of efficiently managing the resources. But it’s those agents with coherent preferences that get to manage everything, so they are what matters for what happens. And if they tend to be built in particular convergent ways, perhaps arbitrary details of their builders are not as relevant to what happens.
That’s the argument for formulating the requisite coherent preferences, they are needed to perform strong optimization. And you want strong optimization because you have all this stuff lying around unoptimized.