Whether it’s horrible or not depends on the X, right? Which I didn’t really define, and which is a whole can of worms (reward misgeneralization, use/mention distinction, etc.). (Like, if we send RL reward based on stock market price, is claim 2 talking about “trained model explicitly wants high stock price”, or “trained model explicitly wants high reward function output”, or “we get to define X to be either of those and many more things besides”? I didn’t really specify.)
But yeah, for X’s that we can plausibly operationalize into a traditional RL reward function†, then I’m inclined to say that we probably don’t want an AI that explicitly wants X. I think we’re in agreement on that.
† (as opposed to weirder things like “the reward depends in part on the AI’s thought process, and not just its outputs”)
Noting that I think this is not “perfect” AGI alignment, and I think it’s actually a pretty horrible outcome.
Whether it’s horrible or not depends on the X, right? Which I didn’t really define, and which is a whole can of worms (reward misgeneralization, use/mention distinction, etc.). (Like, if we send RL reward based on stock market price, is claim 2 talking about “trained model explicitly wants high stock price”, or “trained model explicitly wants high reward function output”, or “we get to define X to be either of those and many more things besides”? I didn’t really specify.)
But yeah, for X’s that we can plausibly operationalize into a traditional RL reward function†, then I’m inclined to say that we probably don’t want an AI that explicitly wants X. I think we’re in agreement on that.
† (as opposed to weirder things like “the reward depends in part on the AI’s thought process, and not just its outputs”)