As AIs become super-human there’s a risk we do increasingly reward them for tricking us into thinking they’ve done a better job than they have
(some quick thoughts.) This is not where the risk stems from.
The risk is that as AIs become superhuman, they’ll produce behaviour that gets a high reward regardless of their goals, for instrumental reasons. In training and until it has a chance to take over, a smart enough AI will be maximally nice to you, even if it’s Clippy; and so training won’t distinguish between the goals of very capable AI systems. All of them will instrumentally achieve a high reward.
In other words, gradient descent will optimize for capably outputting behavior that gets rewarded; it doesn’t care about the goals that give rise to that behavior. Furthermore, in training, while AI systems are not coherent enough agents, their fuzzy optimization targets are not indicative of optimization targets of a fully trained coherent agent (1, 2).
My view- and I expect it to be the view of many in the field- is that if AI is capable enoguh to take over, its goals are likely to be random and not aligned with ours. (There isn’t a literally zero chance of the goals being aligned, but it’s fairly small, smaller than just random because there’s a bias towards shorter representation; I won’t argue for that here, though, and will just note that the goals exactly opposite of aligned are approximately as likely as aligned goals).
It won’t be a noticeable update on its goals if AI takes over: I already expect them to be almost certainly misaligned, and also, I don’t expect the chance of a goal-directed aligned AI taking over to be that much lower.
The crux here is not that update but how easy alignment is. As Evan noted, if we live in one of the alignment-is-easy worlds, sure, if a (probably nice) AI takes over, this is much better than if a (probably not nice) human takes over. But if we live in one of the alignment-is-hard worlds, AI taking over just means that yep, AI companies continued the race for more capable AI systems, got one that was capable enough to take over, and it took over. Their misalignment and the death of all humans isn’t an update from AI taking over; it’s an update from the kind of world we live in.
(We already have empirical evidence that suggests this world is unlikely to be an alignment-is-easy one, as, e.g., current AI systems already exhibit what believers in alignment-is-hard have been predicting for goal-directed systems: they try to output behavior that gets high reward regardless of alignment between their goals and the reward function.)
(some quick thoughts.) This is not where the risk stems from.
The risk is that as AIs become superhuman, they’ll produce behaviour that gets a high reward regardless of their goals, for instrumental reasons. In training and until it has a chance to take over, a smart enough AI will be maximally nice to you, even if it’s Clippy; and so training won’t distinguish between the goals of very capable AI systems. All of them will instrumentally achieve a high reward.
In other words, gradient descent will optimize for capably outputting behavior that gets rewarded; it doesn’t care about the goals that give rise to that behavior. Furthermore, in training, while AI systems are not coherent enough agents, their fuzzy optimization targets are not indicative of optimization targets of a fully trained coherent agent (1, 2).
My view- and I expect it to be the view of many in the field- is that if AI is capable enoguh to take over, its goals are likely to be random and not aligned with ours. (There isn’t a literally zero chance of the goals being aligned, but it’s fairly small, smaller than just random because there’s a bias towards shorter representation; I won’t argue for that here, though, and will just note that the goals exactly opposite of aligned are approximately as likely as aligned goals).
It won’t be a noticeable update on its goals if AI takes over: I already expect them to be almost certainly misaligned, and also, I don’t expect the chance of a goal-directed aligned AI taking over to be that much lower.
The crux here is not that update but how easy alignment is. As Evan noted, if we live in one of the alignment-is-easy worlds, sure, if a (probably nice) AI takes over, this is much better than if a (probably not nice) human takes over. But if we live in one of the alignment-is-hard worlds, AI taking over just means that yep, AI companies continued the race for more capable AI systems, got one that was capable enough to take over, and it took over. Their misalignment and the death of all humans isn’t an update from AI taking over; it’s an update from the kind of world we live in.
(We already have empirical evidence that suggests this world is unlikely to be an alignment-is-easy one, as, e.g., current AI systems already exhibit what believers in alignment-is-hard have been predicting for goal-directed systems: they try to output behavior that gets high reward regardless of alignment between their goals and the reward function.)