We don’t know how to pick which quantity would be maximized by a would-be strong consequentialist maximizer
Yeah so I think this is the crux of it. My point is that if we find some training approach that leads to a model that cares about the world itself rather than hacking some reward function, that’s a sign that we can in fact guide the model in important ways and there’s a good chance this includes being able to tell it not to kill everyone
We don’t know know what a strong consequentialist maximizer would look like, if we had one around, because we don’t have one around (because if we did, we’d be dead)
This is just a way of saying “we don’t know what AGI would do”. I don’t think this point pushes us toward x-risk any more than it pushes us toward not-x-risk.
I think this goes to Matthew Barnett’s recent article of actually yes we do. And regardless I don’t think this point is a big part of Eliezer’s argument. https://www.lesswrong.com/posts/i5kijcjFJD6bn7dwq/evaluating-the-historical-value-misspecification-argument
Yeah so I think this is the crux of it. My point is that if we find some training approach that leads to a model that cares about the world itself rather than hacking some reward function, that’s a sign that we can in fact guide the model in important ways and there’s a good chance this includes being able to tell it not to kill everyone
This is just a way of saying “we don’t know what AGI would do”. I don’t think this point pushes us toward x-risk any more than it pushes us toward not-x-risk.