Both are possible. For theoretical examples, see the stamp collector for consequentialist AI and AIXI for reward-maximizing AI.
What kind of AI are the AIs we have now? Neither, they’re not particularly strong maximizers. (if they were, we’d be dead; it’s not that difficult to turn a powerful reward maximizer into a world-ending AI).
If the former, I think this makes alignment much easier. As long as you can reasonably represent “do not kill everyone”, you can make this a goal of the AI, and then it will literally care about not killing everyone, it won’t just care about hacking its reward system so that it will not perceive everyone being dead.
This would be true, except:
We don’t know how to represent “do not kill everyone”
We don’t know how to pick which quantity would be maximized by a would-be strong consequentialist maximizer
We don’t know know what a strong consequentialist maximizer would look like, if we had one around, because we don’t have one around (because if we did, we’d be dead)
We don’t know how to pick which quantity would be maximized by a would-be strong consequentialist maximizer
Yeah so I think this is the crux of it. My point is that if we find some training approach that leads to a model that cares about the world itself rather than hacking some reward function, that’s a sign that we can in fact guide the model in important ways and there’s a good chance this includes being able to tell it not to kill everyone
We don’t know know what a strong consequentialist maximizer would look like, if we had one around, because we don’t have one around (because if we did, we’d be dead)
This is just a way of saying “we don’t know what AGI would do”. I don’t think this point pushes us toward x-risk any more than it pushes us toward not-x-risk.
Both are possible. For theoretical examples, see the stamp collector for consequentialist AI and AIXI for reward-maximizing AI.
What kind of AI are the AIs we have now? Neither, they’re not particularly strong maximizers. (if they were, we’d be dead; it’s not that difficult to turn a powerful reward maximizer into a world-ending AI).
This would be true, except:
We don’t know how to represent “do not kill everyone”
We don’t know how to pick which quantity would be maximized by a would-be strong consequentialist maximizer
We don’t know know what a strong consequentialist maximizer would look like, if we had one around, because we don’t have one around (because if we did, we’d be dead)
I think this goes to Matthew Barnett’s recent article of actually yes we do. And regardless I don’t think this point is a big part of Eliezer’s argument. https://www.lesswrong.com/posts/i5kijcjFJD6bn7dwq/evaluating-the-historical-value-misspecification-argument
Yeah so I think this is the crux of it. My point is that if we find some training approach that leads to a model that cares about the world itself rather than hacking some reward function, that’s a sign that we can in fact guide the model in important ways and there’s a good chance this includes being able to tell it not to kill everyone
This is just a way of saying “we don’t know what AGI would do”. I don’t think this point pushes us toward x-risk any more than it pushes us toward not-x-risk.