I’ve also seen this conflation of evaluation functions with goals, and came to similar conclusion about the validity of this.
I do think the instrumental convergence thesis is still very much a danger. If there are any goals, for sufficiently powerful systems it seems plausible that instrumental convergence still applies to them, and separation of goals from evaluation just means that they’re more opaque to us. Opacity of AI goals is not great from a human point of view.
In practice the evaluation function matters even less than most such discussion implies. In many current systems, only the gradient of the evaluation function matters at all, and not its actual value. That may seem more like an implementation detail, but no optimization process can maximize every evaluation function. Every training process (even including uncomputable processes like Solomonoff induction) that takes an evaluation function as input will fail badly to maximize some functions for arbitrarily many inputs.
So for any nontrivial evaluation function, we should expect that the outputs of any system trying to match it will not be those that maximize it, no matter how superhuman its programming might be.
I do think the instrumental convergence thesis is still very much a danger. If there are any goals, for sufficiently powerful systems it seems plausible that instrumental convergence still applies to them, and separation of goals from evaluation just means that they’re more opaque to us.
I’m not certain about this. In the one existing example of inner vs outer alignment we have (human evolution), the inner goals are surprisingly weak—we are not prepared to achieve them at all costs. Humans like sex, but most wouldn’t be prepared to take over the world just to get more sex. Perhaps this is chance, but perhaps it is actually fundamental—because the inner goals are not aligned with the outer goals, there’s an incentive to stop them being too powerful.
This is exactly the problem that the post describes: the outer evaluation function is simply based on reproduction of genes (which requires sex), but the corresponding inner goal landscape is very different. If the (outer) evaluation function was actually the aim of humans, every man would take extreme actions to try to impregnate every woman on the planet.
There are quite a few examples of humans having goals that do lead them to try to take over the world. They usually have this as an instrumental goal to something that is only vaguely related to the evaluation function. They almost universally fail, but that’s more a matter of an upper bound on variation in human capability than their willingness to achieve such goals.
If humans in reality had some exponential scale of individual capability like some fiction and games, I would expect to see a lot more successful world takeovers. Likewise I wouldn’t expect SI to have those same bounds on capability as current humans. Even a comparatively weak goal that merely as a side effect outweighs everything that humans prefer about the world would be plenty bad enough.
I’ve also seen this conflation of evaluation functions with goals, and came to similar conclusion about the validity of this.
I do think the instrumental convergence thesis is still very much a danger. If there are any goals, for sufficiently powerful systems it seems plausible that instrumental convergence still applies to them, and separation of goals from evaluation just means that they’re more opaque to us. Opacity of AI goals is not great from a human point of view.
In practice the evaluation function matters even less than most such discussion implies. In many current systems, only the gradient of the evaluation function matters at all, and not its actual value. That may seem more like an implementation detail, but no optimization process can maximize every evaluation function. Every training process (even including uncomputable processes like Solomonoff induction) that takes an evaluation function as input will fail badly to maximize some functions for arbitrarily many inputs.
So for any nontrivial evaluation function, we should expect that the outputs of any system trying to match it will not be those that maximize it, no matter how superhuman its programming might be.
I’m not certain about this. In the one existing example of inner vs outer alignment we have (human evolution), the inner goals are surprisingly weak—we are not prepared to achieve them at all costs. Humans like sex, but most wouldn’t be prepared to take over the world just to get more sex. Perhaps this is chance, but perhaps it is actually fundamental—because the inner goals are not aligned with the outer goals, there’s an incentive to stop them being too powerful.
This is exactly the problem that the post describes: the outer evaluation function is simply based on reproduction of genes (which requires sex), but the corresponding inner goal landscape is very different. If the (outer) evaluation function was actually the aim of humans, every man would take extreme actions to try to impregnate every woman on the planet.
There are quite a few examples of humans having goals that do lead them to try to take over the world. They usually have this as an instrumental goal to something that is only vaguely related to the evaluation function. They almost universally fail, but that’s more a matter of an upper bound on variation in human capability than their willingness to achieve such goals.
If humans in reality had some exponential scale of individual capability like some fiction and games, I would expect to see a lot more successful world takeovers. Likewise I wouldn’t expect SI to have those same bounds on capability as current humans. Even a comparatively weak goal that merely as a side effect outweighs everything that humans prefer about the world would be plenty bad enough.