I do think the instrumental convergence thesis is still very much a danger. If there are any goals, for sufficiently powerful systems it seems plausible that instrumental convergence still applies to them, and separation of goals from evaluation just means that they’re more opaque to us.
I’m not certain about this. In the one existing example of inner vs outer alignment we have (human evolution), the inner goals are surprisingly weak—we are not prepared to achieve them at all costs. Humans like sex, but most wouldn’t be prepared to take over the world just to get more sex. Perhaps this is chance, but perhaps it is actually fundamental—because the inner goals are not aligned with the outer goals, there’s an incentive to stop them being too powerful.
This is exactly the problem that the post describes: the outer evaluation function is simply based on reproduction of genes (which requires sex), but the corresponding inner goal landscape is very different. If the (outer) evaluation function was actually the aim of humans, every man would take extreme actions to try to impregnate every woman on the planet.
There are quite a few examples of humans having goals that do lead them to try to take over the world. They usually have this as an instrumental goal to something that is only vaguely related to the evaluation function. They almost universally fail, but that’s more a matter of an upper bound on variation in human capability than their willingness to achieve such goals.
If humans in reality had some exponential scale of individual capability like some fiction and games, I would expect to see a lot more successful world takeovers. Likewise I wouldn’t expect SI to have those same bounds on capability as current humans. Even a comparatively weak goal that merely as a side effect outweighs everything that humans prefer about the world would be plenty bad enough.
I’m not certain about this. In the one existing example of inner vs outer alignment we have (human evolution), the inner goals are surprisingly weak—we are not prepared to achieve them at all costs. Humans like sex, but most wouldn’t be prepared to take over the world just to get more sex. Perhaps this is chance, but perhaps it is actually fundamental—because the inner goals are not aligned with the outer goals, there’s an incentive to stop them being too powerful.
This is exactly the problem that the post describes: the outer evaluation function is simply based on reproduction of genes (which requires sex), but the corresponding inner goal landscape is very different. If the (outer) evaluation function was actually the aim of humans, every man would take extreme actions to try to impregnate every woman on the planet.
There are quite a few examples of humans having goals that do lead them to try to take over the world. They usually have this as an instrumental goal to something that is only vaguely related to the evaluation function. They almost universally fail, but that’s more a matter of an upper bound on variation in human capability than their willingness to achieve such goals.
If humans in reality had some exponential scale of individual capability like some fiction and games, I would expect to see a lot more successful world takeovers. Likewise I wouldn’t expect SI to have those same bounds on capability as current humans. Even a comparatively weak goal that merely as a side effect outweighs everything that humans prefer about the world would be plenty bad enough.