I’d also accept neuroscience RL literature, and also accept theories that would make useful predictions or give conditions on when RL algorithms optimize for the reward, not just empirical results.
That’s probably as much of that post as I’ll get around to. It’s not high on my priority list because I don’t see how it’s a crux for any important alignment theory. I may cover what I think is important about it in the “behaviorist...” post.
Edit: I was going to ask why you were thinking this was important.
It seems pretty cut and dried; even TurnTrout wasn’t claiming this was true beyond model-free RL. I guess LLMs are model-free, so that’s relevant. I just expect them to be turned into agents with explicit goals, so I don’t worry much about how they behave in base form.
Interesting. There’s certainly a lot going on in there, and some of it very likely is at least vague models of future word occurrences (and corresponding events). The definition of model-based gets pretty murky outside of classic RL, so it’s probably best to just directly discuss what model properties give rise to what behavior, e.g. optimizing for reward.
Model-free systems can produce goal-directed behavior. The do this if they have seen some relevant behavior that achieves a given goal, and their input or some internal representation includes the current goal, and they can generalize well enough to apply what they’ve experienced to the current context. (This is by the neuroscience definition of habitual vs goal-directed: behavior changes to follow the current goal, usually hungry, thirsty or not).
So if they’re strong enough generalizers, I think even a model-free system actually optimizes for reward.
I think the claim should be stronger: for a smart enough RL system, reward is the optimization target.
IMO, the important crux is whether we really need to secure the reward function from wireheading/tampering, because a RL algorithm optimizing for the reward means you will need to have much more security/make much more robust reward functions than in the case where RL algorithms don’t optimize for the reward, because optimization amplifies problems and solutions.
Ah yes. I agree that the wireheading question deserves more thought. I’m not confident that my answer to wireheading applies to the types of AI we’ll actually build—I haven’t thought about it enough.
FWIW the two papers I cited are secondary research, so they branch directly into a massive amount of neuroscience research that indirectly bears on the question in mammalian brains. None of it I can think of directly addresses the question of whether reward is the optimization target for humans. I’m not sure how you’d empirically test this.
I do think it’s pretty clear that some types of smart, model-based RL agents would optimize for reward. Those are the ones that a) choose actions based on highest estimated sum of future rewards (like humans seem to, very very approximately), and that are smart enough to estimate future rewards fairly accurately.
LLMs with RLHF/RLAIF may be the relevant case. They are model-free by TurnTrout’s definition, and I’m happy to accept his use of the terminology. But they do have a powerful critic component (at least in training—I’m not sure about deployment, but probably there too)0, so it seems possible that it might develop a highly general representation of “stuff that gives the system rewards”. I’m not worried about that, because I think that will happen long after we’ve given them agentic goals, and long after they’ve developed a representation of “stuff humans reward me for doing”—which could be mis-specified enough to lead to doom if it was the only factor.
I’d also accept neuroscience RL literature, and also accept theories that would make useful predictions or give conditions on when RL algorithms optimize for the reward, not just empirical results.
At any rate, I’d like to see your post soon.
That’s probably as much of that post as I’ll get around to. It’s not high on my priority list because I don’t see how it’s a crux for any important alignment theory. I may cover what I think is important about it in the “behaviorist...” post.
Edit: I was going to ask why you were thinking this was important.
It seems pretty cut and dried; even TurnTrout wasn’t claiming this was true beyond model-free RL. I guess LLMs are model-free, so that’s relevant. I just expect them to be turned into agents with explicit goals, so I don’t worry much about how they behave in base form.
FWIW, I strongly disagree with this claim. I believe they are model-based, with the usual datasets & training approaches, even before RLHF/RLAIF.
What do you mean by “model-based”?
Interesting. There’s certainly a lot going on in there, and some of it very likely is at least vague models of future word occurrences (and corresponding events). The definition of model-based gets pretty murky outside of classic RL, so it’s probably best to just directly discuss what model properties give rise to what behavior, e.g. optimizing for reward.
Model-free systems can produce goal-directed behavior. The do this if they have seen some relevant behavior that achieves a given goal, and their input or some internal representation includes the current goal, and they can generalize well enough to apply what they’ve experienced to the current context. (This is by the neuroscience definition of habitual vs goal-directed: behavior changes to follow the current goal, usually hungry, thirsty or not).
So if they’re strong enough generalizers, I think even a model-free system actually optimizes for reward.
I think the claim should be stronger: for a smart enough RL system, reward is the optimization target.
IMO, the important crux is whether we really need to secure the reward function from wireheading/tampering, because a RL algorithm optimizing for the reward means you will need to have much more security/make much more robust reward functions than in the case where RL algorithms don’t optimize for the reward, because optimization amplifies problems and solutions.
Ah yes. I agree that the wireheading question deserves more thought. I’m not confident that my answer to wireheading applies to the types of AI we’ll actually build—I haven’t thought about it enough.
FWIW the two papers I cited are secondary research, so they branch directly into a massive amount of neuroscience research that indirectly bears on the question in mammalian brains. None of it I can think of directly addresses the question of whether reward is the optimization target for humans. I’m not sure how you’d empirically test this.
I do think it’s pretty clear that some types of smart, model-based RL agents would optimize for reward. Those are the ones that a) choose actions based on highest estimated sum of future rewards (like humans seem to, very very approximately), and that are smart enough to estimate future rewards fairly accurately.
LLMs with RLHF/RLAIF may be the relevant case. They are model-free by TurnTrout’s definition, and I’m happy to accept his use of the terminology. But they do have a powerful critic component (at least in training—I’m not sure about deployment, but probably there too)0, so it seems possible that it might develop a highly general representation of “stuff that gives the system rewards”. I’m not worried about that, because I think that will happen long after we’ve given them agentic goals, and long after they’ve developed a representation of “stuff humans reward me for doing”—which could be mis-specified enough to lead to doom if it was the only factor.