My way of thinking about this is that yes, the act of training a system with feedback of any kind is to invoke an effectively intelligent optimization process that will do its best to maximize expected feedback results.
No, this is wrong. Reward is not the optimization target. Models Don’t “Get Reward”. Humans get feedback from subcortical reward and prediction errors, and yet most people are not hedonists/positive-reward-prediction-error-maximizers or prediction-error-minimizers (a la predictive processing).
A possible answer is that humans do optimize for reward, but the way they avoid falling into degenerate solutions to maximize rewards ultimately comes down to negative feedback loops akin to the hedonic treadmill.
It essentially exploits the principle that the problem is not the initial dose of reward, but the feedback loop of getting more and more reward via reward hacking that’s the problem.
Hedonic loops and Taming RL show how that’s done in the brain.
I am not fully convinced I am wrong about this, but I am confident I am not making the simple form of this mistake that I believe you think that I am making here. I do get that models don’t ‘get reward’ and that there will be various ways in which the best available way to maximize reward will in effect be to maximize something else that tends to maximize reward, although I expect this difference to shrink as capabilities advance.
there will be various ways in which the best available way to maximize reward will in effect be to maximize something else that tends to maximize reward
I further think that it’s generally wrong[1] to ask “does this policy maximize reward?” in order to predict whether such a policy will be trained. It’s at least a tiny bit useful, but I don’t think it’s very useful. Do you disagree?
No, this is wrong. Reward is not the optimization target. Models Don’t “Get Reward”. Humans get feedback from subcortical reward and prediction errors, and yet most people are not hedonists/positive-reward-prediction-error-maximizers or prediction-error-minimizers (a la predictive processing).
I wrote a sequence of posts trying to explain why I think this line of thinking is wrong. You might start off with Inner and outer alignment decompose one hard problem into two extremely hard problems.
A possible answer is that humans do optimize for reward, but the way they avoid falling into degenerate solutions to maximize rewards ultimately comes down to negative feedback loops akin to the hedonic treadmill.
It essentially exploits the principle that the problem is not the initial dose of reward, but the feedback loop of getting more and more reward via reward hacking that’s the problem.
Hedonic loops and Taming RL show how that’s done in the brain.
https://www.lesswrong.com/posts/3mwfyLpnYqhqvprbb/hedonic-loops-and-taming-rl
I am not fully convinced I am wrong about this, but I am confident I am not making the simple form of this mistake that I believe you think that I am making here. I do get that models don’t ‘get reward’ and that there will be various ways in which the best available way to maximize reward will in effect be to maximize something else that tends to maximize reward, although I expect this difference to shrink as capabilities advance.
I further think that it’s generally wrong[1] to ask “does this policy maximize reward?” in order to predict whether such a policy will be trained. It’s at least a tiny bit useful, but I don’t think it’s very useful. Do you disagree?
There are some environments where that question is appropriate, and I don’t think those environments include LLM finetuning.