I’m a bit confused by the intro saying that RL can’t do this, especially since you later on say the neocortex is doing model-based RL. I think current model-based RL algorithms would likely do fine on a toy version of this task, with e.g. a 2D binary state space (salt deprived or not; salt water or not) and two actions (press lever or no-op). The idea would be:
- Agent explores by pressing lever, learns transition dynamics that pressing lever ⇒ spray of salt water.
- Planner concludes that any sequence of actions involving pressing lever will result in salt water spray. In a non salt-deprived state this has negative reward, so the agent avoids it.
- Once the agent becomes salt deprived, the planner will conclude this has positive reward, and so take that action.
I do agree that a typical model-free RL algorithm is not capable of doing this directly (it could perhaps meta-learn a policy with memory that can solve this).
Good question! Sorry I didn’t really explain. The missing piece is “the planner will conclude this has positive reward”. The planner has no basis for coming up with this conclusion, that I can see.
In typical RL as I understand it, regardless of whether it’s model-based or model-free, you learn about what is rewarding by seeing the outputs of the reward function. Like, if an RL agent is playing an Atari game, it does not see the source code that calculates the reward function. It can try to figure out how the reward function works, for sure, but when it does that, all it has to go on is the observations of what the reward function has output in the past. (Related discussion.)
So yeah, in the salt-deprived state, the reward function has changed. But how does the planner know that? It hasn’t seen the salt-deprived state before. Presumably if you built such a planner, it would go in with a default assumption of “the salt-deprivation state is different now than I’ve ever seen before—I’ll just assume that that doesn’t affect the reward function!” Or at best, its default assumption would be “the salt deprivation state is different now than I’ve ever seen before—I don’t know how and whether that impacts the reward function. I should increase my uncertainty. Maybe explore more.”. In this experiment the rats were neither of those, instead they were acting like “the salt deprivation state is different than I’ve ever seen, and I specifically know that, in this new state, very salty things are now very rewarding”. They were not behaving as if they were newly uncertain about the reward consequences of the lever, they were absolutely gung-ho about pressing it.
Thanks for the clarification! I agree if the planner does not have access to the reward function then it will not be able to solve it. Though, as you say, it could explore more given the uncertainty.
Most model-based RL algorithms I’ve seen assume they can evaluate the reward functions in arbitrary states. Moreover, it seems to me like this is the key thing that lets rats solve the problem. I don’t see how you solve this problem in general in a sample-efficient manner otherwise.
Most model-based RL algorithms I’ve seen assume they can evaluate the reward functions in arbitrary states.
Hmm. AlphaZero can evaluate the true reward function in arbitrary states. MuZero can’t—it tries to learn the reward function by supervised learning from observations of past rewards (if I understand correctly). I googled “model-based RL Atari” and the first hit was this which likewise tries to learn the reward function by supervised learning from observations of past rewards (if I understand correctly). I’m not intimately familiar with the deep RL literature, I wouldn’t know what’s typical and I’ll take your word for it, but it does seem that both possibilities are out there.
Anyway, I don’t think the neocortex can evaluate the true reward function in arbitrary states, because it’s not a neat mathematical function, it involves messy things like the outputs of millions of pain receptors, hormones sloshing around, the input-output relationships of entire brain subsystems containing tens of millions of neurons, etc. So I presume that the neocortex tries to learn the reward function by supervised learning from observations of past rewards—and that’s the whole thing with TD learning and dopamine.
I added a new sub-bullet at the top to clarify that it’s hard to explain by RL unless you assume the planner can query the ground-truth reward function in arbitrary hypothetical states. And then I also added a new paragraph to the “other possible explanations” section at the bottom saying what I said in the paragraph just above. Thank you.
I don’t see how you solve this problem in general in a sample-efficient manner otherwise.
Well, the rats are trying to do the rewarding thing after zero samples, so I don’t think “sample-efficiency” is quite the right framing.
In ML today, the reward function is typically a function of states and actions, not “thoughts”. In a brain, the reward can depend directly on what you’re imagining doing or planning to do, or even just what you’re thinking about. That’s my proposal here.
Well, I guess you could say that this is still a “normal MDP”, but where “having thoughts” and “having ideas” etc. are part of the state / action space. But anyway, I think that’s a bit different than how most ML people would normally think about things.
I googled “model-based RL Atari” and the first hit was this which likewise tries to learn the reward function by supervised learning from observations of past rewards (if I understand correctly)
Ah, the “model-based using a model-free RL algorithm” approach :) They learn a world model using supervised learning, and then use PPO (a model-free RL algorithm) to train a policy in it. It sounds odd but it makes sense: you hopefully get much of the sample efficiency of model-based training, while still retaining the state-of-the-art results of model-free RL. You’re right that in this setup, as the actions are being chosen by the (model-free RL) policy, you don’t get any zero-shot generalization.
I added a new sub-bullet at the top to clarify that it’s hard to explain by RL unless you assume the planner can query the ground-truth reward function in arbitrary hypothetical states. And then I also added a new paragraph to the “other possible explanations” section at the bottom saying what I said in the paragraph just above. Thank you.
Thanks for updating the post to clarify this point—I agree with you with the new wording.
In ML today, the reward function is typically a function of states and actions, not “thoughts”. In a brain, the reward can depend directly on what you’re imagining doing or planning to do, or even just what you’re thinking about. That’s my proposal here.
Yes indeed, your proposal is quite different from RL. The closest I can think of to rewards over “thoughts” in ML would be regularization terms that take into account weights or, occasionally, activations—but that’s very crude compared to what you’re proposing.
I’m a bit confused by the intro saying that RL can’t do this, especially since you later on say the neocortex is doing model-based RL. I think current model-based RL algorithms would likely do fine on a toy version of this task, with e.g. a 2D binary state space (salt deprived or not; salt water or not) and two actions (press lever or no-op). The idea would be:
- Agent explores by pressing lever, learns transition dynamics that pressing lever ⇒ spray of salt water.
- Planner concludes that any sequence of actions involving pressing lever will result in salt water spray. In a non salt-deprived state this has negative reward, so the agent avoids it.
- Once the agent becomes salt deprived, the planner will conclude this has positive reward, and so take that action.
I do agree that a typical model-free RL algorithm is not capable of doing this directly (it could perhaps meta-learn a policy with memory that can solve this).
Good question! Sorry I didn’t really explain. The missing piece is “the planner will conclude this has positive reward”. The planner has no basis for coming up with this conclusion, that I can see.
In typical RL as I understand it, regardless of whether it’s model-based or model-free, you learn about what is rewarding by seeing the outputs of the reward function. Like, if an RL agent is playing an Atari game, it does not see the source code that calculates the reward function. It can try to figure out how the reward function works, for sure, but when it does that, all it has to go on is the observations of what the reward function has output in the past. (Related discussion.)
So yeah, in the salt-deprived state, the reward function has changed. But how does the planner know that? It hasn’t seen the salt-deprived state before. Presumably if you built such a planner, it would go in with a default assumption of “the salt-deprivation state is different now than I’ve ever seen before—I’ll just assume that that doesn’t affect the reward function!” Or at best, its default assumption would be “the salt deprivation state is different now than I’ve ever seen before—I don’t know how and whether that impacts the reward function. I should increase my uncertainty. Maybe explore more.”. In this experiment the rats were neither of those, instead they were acting like “the salt deprivation state is different than I’ve ever seen, and I specifically know that, in this new state, very salty things are now very rewarding”. They were not behaving as if they were newly uncertain about the reward consequences of the lever, they were absolutely gung-ho about pressing it.
Sorry if I’m misunderstanding :-)
Thanks for the clarification! I agree if the planner does not have access to the reward function then it will not be able to solve it. Though, as you say, it could explore more given the uncertainty.
Most model-based RL algorithms I’ve seen assume they can evaluate the reward functions in arbitrary states. Moreover, it seems to me like this is the key thing that lets rats solve the problem. I don’t see how you solve this problem in general in a sample-efficient manner otherwise.
One class of model-based RL approaches is based on [model-predictive control](https://en.wikipedia.org/wiki/Model_predictive_control): sample random actions, “rollout” the trajectories in the model, pick the trajectory that had the highest return and then take the first action from that trajectory, then replan. That said, assumptions vary. [iLQR](https://en.wikipedia.org/wiki/Linear%E2%80%93quadratic_regulator) makes the stronger assumption that reward is quadratic and differentiable.
I think methods based on [Monte Carlo tree search](https://en.wikipedia.org/wiki/Monte_Carlo_tree_search) might exhibit something like the problem you discuss. Since they sample actions from a policy trained to maximize reward, they might end up not exploring enough in this novel state if the policy is very confident it should not drink the salt water. That said, they typically include explicit methods for exploration like [UCB](https://en.wikipedia.org/wiki/Thompson_sampling#Upper-Confidence-Bound_(UCB)_algorithms) which should mitigate this.
Hmm. AlphaZero can evaluate the true reward function in arbitrary states. MuZero can’t—it tries to learn the reward function by supervised learning from observations of past rewards (if I understand correctly). I googled “model-based RL Atari” and the first hit was this which likewise tries to learn the reward function by supervised learning from observations of past rewards (if I understand correctly). I’m not intimately familiar with the deep RL literature, I wouldn’t know what’s typical and I’ll take your word for it, but it does seem that both possibilities are out there.
Anyway, I don’t think the neocortex can evaluate the true reward function in arbitrary states, because it’s not a neat mathematical function, it involves messy things like the outputs of millions of pain receptors, hormones sloshing around, the input-output relationships of entire brain subsystems containing tens of millions of neurons, etc. So I presume that the neocortex tries to learn the reward function by supervised learning from observations of past rewards—and that’s the whole thing with TD learning and dopamine.
I added a new sub-bullet at the top to clarify that it’s hard to explain by RL unless you assume the planner can query the ground-truth reward function in arbitrary hypothetical states. And then I also added a new paragraph to the “other possible explanations” section at the bottom saying what I said in the paragraph just above. Thank you.
Well, the rats are trying to do the rewarding thing after zero samples, so I don’t think “sample-efficiency” is quite the right framing.
In ML today, the reward function is typically a function of states and actions, not “thoughts”. In a brain, the reward can depend directly on what you’re imagining doing or planning to do, or even just what you’re thinking about. That’s my proposal here.
Well, I guess you could say that this is still a “normal MDP”, but where “having thoughts” and “having ideas” etc. are part of the state / action space. But anyway, I think that’s a bit different than how most ML people would normally think about things.
Ah, the “model-based using a model-free RL algorithm” approach :) They learn a world model using supervised learning, and then use PPO (a model-free RL algorithm) to train a policy in it. It sounds odd but it makes sense: you hopefully get much of the sample efficiency of model-based training, while still retaining the state-of-the-art results of model-free RL. You’re right that in this setup, as the actions are being chosen by the (model-free RL) policy, you don’t get any zero-shot generalization.
Thanks for updating the post to clarify this point—I agree with you with the new wording.
Yes indeed, your proposal is quite different from RL. The closest I can think of to rewards over “thoughts” in ML would be regularization terms that take into account weights or, occasionally, activations—but that’s very crude compared to what you’re proposing.