I love this question! As it happens, I have some rough draft for a post titled something like “’reward is the optimization target for smart RL agents”.
TLDR: I think this is true for some AI systems, but not likely true for any RL-directed AGI systems whose safety we should really worry about. They’ll optimize for maximum reward even more than humans do, unless they’re very carefully built to avoid that behavior.
However, I should have stated up-front: This post addresses model-free policy gradient algorithms like PPO and REINFORCE.
Humans are definitely model-based RL learners at least some of the time—particularly for important decisions.[1] So the claim doesn’t apply to them. I also don’t think it applies to any other capable agent. TurnTrout actually makes a congruent claim in his other post Think carefully before calling RL policies “agents”. Model-free RL algorithms only have limited agency, what I’d call level 1-of-3:
Trained to achieve some goal/reward.
Habitual behavior/model-free RL
Predicts outcomes of actions and selects ones that achieve a goal/reward.
Model-based RL
Selects future states that achieve a goal/reward and then plans actions to achieve that state
No corresponding terminology, (goal-directed from neuroscience applies to levels 2 and even 1[1]) but pretty clearly highly useful for humans
But humans don’t seem to optimize for reward all that often! They make self-sacrificial decisions that get them killed. And they usually say they’d refuse to get in Nozick’s experience machine, which would hypothetically remove them from this world and give them a simulated world of maximally-rewarding experiences. They’re seeming to optimize for the things that have given them reward, like protecting loved ones, rather than optimizing for reward themselves—just like TurnTrout describes in RINTOT. And humans are model-based for important decisions, presumably using sophisticated models. What gives?
My cognitive neuroscience research focused a lot on dopamine, so I’ve thought a lot about how reward shapes human behavior. The most complete publication is Neural mechanisms of human decision-making as a summary of how humans seem to learn complex behaviors using reward and predictions of reward. But that’s not really very good description of the overall theory, because neuroscientists are highly suspicious of broad theories, and because I didn’t really want to accidentally accelerate AGI research by describing brain function clearly. I know.
I think humans do optimize for reward, we just do it badly. We do see some sophisticated hedonists with exceptional amounts of time and money say things like “I love new experiences”. This has abstracted almost all of the specifics. Yudkowsky’s “fun theory” also describes a pursuit of reward if you grant that “fun” refers to frequent, strong dopamine spikes (I think that’s exactly what we mean by fun). I think more sophisticated hedonists will get in the experience box- but this is complicated by the approximations in human decision-making. It’s pretty likely that the suffering you’d cause your loved ones by getting in the box and leaving them alone would be so salient, and produce such a negative-reward-prediction, that it would outweigh all of the many positive predictions of reward, just based on saliency and our inefficient way of roughly totaling predicted future reward by imagining salient outcomes and roughly averaging over their reward predictions.
So I think the more rational and cognitively capable a human is, the more likely they’ll optimize more strictly and accurately for future reward. And I think the same is true of model-based RL systems with any decent decision-making process.
I realize this isn’t the empirically-based answer you asked for. I think the answer has to be based on theory, because some systems will and some won’t optimize for reward. I don’t know the ML RL literature nearly as well as I know the neuroscience RL literature, so there might be some really relevant stuff out there I’m not aware of. I doubt it, because this is such an AI-safety question.[2]
So that’s why I think reward is the optimization target for smart RL agents.
Edit: Thus, RINTOT and similar work has, I think, really confused the AGI safety debate by making strong claims about current AI that don’t apply at all to the AGI we’re worried about. I’ve been thinking about this a lot in the context of a post I’d call “Current AI and alignment theory is largely behaviorist. Expect a cognitive revolution”.
We debated the terminologies habitual/goal-directed, automatic and controlled, system 1/system 2, and model-free/model-based for years. All of them have limitations, and all of them mean slightly different things. In particular, model-based is vague terminology when systems get more complex than simple RL—but it is very clear that many complex human decisions (certainly ones in which we envision possible outcomes before taking actions) are far on the model-based side, and meet every definition.
One follow-on question is whether RL-based AGI will wirehead. I think this is almost the same question as getting into the experience box—except that that box will only keep going if the AGI engineers it correctly to keep going. So it’s going to have to do a lot of planning before wireheading, unless its decision-making algorithm is highly biased toward near-term rewards over long-term ones. In the course of doing that planning, its other motivations will come into play—like the well-being of humans, if it cares about that. So whether or not our particular AGI will wirehead probably won’t determine our fate.
I’d also accept neuroscience RL literature, and also accept theories that would make useful predictions or give conditions on when RL algorithms optimize for the reward, not just empirical results.
That’s probably as much of that post as I’ll get around to. It’s not high on my priority list because I don’t see how it’s a crux for any important alignment theory. I may cover what I think is important about it in the “behaviorist...” post.
Edit: I was going to ask why you were thinking this was important.
It seems pretty cut and dried; even TurnTrout wasn’t claiming this was true beyond model-free RL. I guess LLMs are model-free, so that’s relevant. I just expect them to be turned into agents with explicit goals, so I don’t worry much about how they behave in base form.
IMO, the important crux is whether we really need to secure the reward function from wireheading/tampering, because a RL algorithm optimizing for the reward means you will need to have much more security/make much more robust reward functions than in the case where RL algorithms don’t optimize for the reward, because optimization amplifies problems and solutions.
I love this question! As it happens, I have some rough draft for a post titled something like “’reward is the optimization target for smart RL agents”.
TLDR: I think this is true for some AI systems, but not likely true for any RL-directed AGI systems whose safety we should really worry about. They’ll optimize for maximum reward even more than humans do, unless they’re very carefully built to avoid that behavior.
In the final comment on the second thread you linked, TurnTrout says of his Reward is not the optimization target:
Humans are definitely model-based RL learners at least some of the time—particularly for important decisions.[1] So the claim doesn’t apply to them. I also don’t think it applies to any other capable agent. TurnTrout actually makes a congruent claim in his other post Think carefully before calling RL policies “agents”. Model-free RL algorithms only have limited agency, what I’d call level 1-of-3:
Trained to achieve some goal/reward.
Habitual behavior/model-free RL
Predicts outcomes of actions and selects ones that achieve a goal/reward.
Model-based RL
Selects future states that achieve a goal/reward and then plans actions to achieve that state
No corresponding terminology, (goal-directed from neuroscience applies to levels 2 and even 1[1]) but pretty clearly highly useful for humans
That’s from my post Steering subsystems: capabilities, agency, and alignment.
But humans don’t seem to optimize for reward all that often! They make self-sacrificial decisions that get them killed. And they usually say they’d refuse to get in Nozick’s experience machine, which would hypothetically remove them from this world and give them a simulated world of maximally-rewarding experiences. They’re seeming to optimize for the things that have given them reward, like protecting loved ones, rather than optimizing for reward themselves—just like TurnTrout describes in RINTOT. And humans are model-based for important decisions, presumably using sophisticated models. What gives?
My cognitive neuroscience research focused a lot on dopamine, so I’ve thought a lot about how reward shapes human behavior. The most complete publication is Neural mechanisms of human decision-making as a summary of how humans seem to learn complex behaviors using reward and predictions of reward. But that’s not really very good description of the overall theory, because neuroscientists are highly suspicious of broad theories, and because I didn’t really want to accidentally accelerate AGI research by describing brain function clearly. I know.
I think humans do optimize for reward, we just do it badly. We do see some sophisticated hedonists with exceptional amounts of time and money say things like “I love new experiences”. This has abstracted almost all of the specifics. Yudkowsky’s “fun theory” also describes a pursuit of reward if you grant that “fun” refers to frequent, strong dopamine spikes (I think that’s exactly what we mean by fun). I think more sophisticated hedonists will get in the experience box- but this is complicated by the approximations in human decision-making. It’s pretty likely that the suffering you’d cause your loved ones by getting in the box and leaving them alone would be so salient, and produce such a negative-reward-prediction, that it would outweigh all of the many positive predictions of reward, just based on saliency and our inefficient way of roughly totaling predicted future reward by imagining salient outcomes and roughly averaging over their reward predictions.
So I think the more rational and cognitively capable a human is, the more likely they’ll optimize more strictly and accurately for future reward. And I think the same is true of model-based RL systems with any decent decision-making process.
I realize this isn’t the empirically-based answer you asked for. I think the answer has to be based on theory, because some systems will and some won’t optimize for reward. I don’t know the ML RL literature nearly as well as I know the neuroscience RL literature, so there might be some really relevant stuff out there I’m not aware of. I doubt it, because this is such an AI-safety question.[2]
So that’s why I think reward is the optimization target for smart RL agents.
Edit: Thus, RINTOT and similar work has, I think, really confused the AGI safety debate by making strong claims about current AI that don’t apply at all to the AGI we’re worried about. I’ve been thinking about this a lot in the context of a post I’d call “Current AI and alignment theory is largely behaviorist. Expect a cognitive revolution”.
For more than you want to know about the various terminologies, see How sequential interactive processing within frontostriatal loops supports a continuum of habitual to controlled processing.
We debated the terminologies habitual/goal-directed, automatic and controlled, system 1/system 2, and model-free/model-based for years. All of them have limitations, and all of them mean slightly different things. In particular, model-based is vague terminology when systems get more complex than simple RL—but it is very clear that many complex human decisions (certainly ones in which we envision possible outcomes before taking actions) are far on the model-based side, and meet every definition.
One follow-on question is whether RL-based AGI will wirehead. I think this is almost the same question as getting into the experience box—except that that box will only keep going if the AGI engineers it correctly to keep going. So it’s going to have to do a lot of planning before wireheading, unless its decision-making algorithm is highly biased toward near-term rewards over long-term ones. In the course of doing that planning, its other motivations will come into play—like the well-being of humans, if it cares about that. So whether or not our particular AGI will wirehead probably won’t determine our fate.
It seems we get quite easily addicted to things, which is a form of wireheading. Not just to drugs, but also to various apps and websites.
I have also notice this ;)
I’d also accept neuroscience RL literature, and also accept theories that would make useful predictions or give conditions on when RL algorithms optimize for the reward, not just empirical results.
At any rate, I’d like to see your post soon.
That’s probably as much of that post as I’ll get around to. It’s not high on my priority list because I don’t see how it’s a crux for any important alignment theory. I may cover what I think is important about it in the “behaviorist...” post.
Edit: I was going to ask why you were thinking this was important.
It seems pretty cut and dried; even TurnTrout wasn’t claiming this was true beyond model-free RL. I guess LLMs are model-free, so that’s relevant. I just expect them to be turned into agents with explicit goals, so I don’t worry much about how they behave in base form.
IMO, the important crux is whether we really need to secure the reward function from wireheading/tampering, because a RL algorithm optimizing for the reward means you will need to have much more security/make much more robust reward functions than in the case where RL algorithms don’t optimize for the reward, because optimization amplifies problems and solutions.