Thanks for the clarification! I agree if the planner does not have access to the reward function then it will not be able to solve it. Though, as you say, it could explore more given the uncertainty.
Most model-based RL algorithms I’ve seen assume they can evaluate the reward functions in arbitrary states. Moreover, it seems to me like this is the key thing that lets rats solve the problem. I don’t see how you solve this problem in general in a sample-efficient manner otherwise.
Most model-based RL algorithms I’ve seen assume they can evaluate the reward functions in arbitrary states.
Hmm. AlphaZero can evaluate the true reward function in arbitrary states. MuZero can’t—it tries to learn the reward function by supervised learning from observations of past rewards (if I understand correctly). I googled “model-based RL Atari” and the first hit was this which likewise tries to learn the reward function by supervised learning from observations of past rewards (if I understand correctly). I’m not intimately familiar with the deep RL literature, I wouldn’t know what’s typical and I’ll take your word for it, but it does seem that both possibilities are out there.
Anyway, I don’t think the neocortex can evaluate the true reward function in arbitrary states, because it’s not a neat mathematical function, it involves messy things like the outputs of millions of pain receptors, hormones sloshing around, the input-output relationships of entire brain subsystems containing tens of millions of neurons, etc. So I presume that the neocortex tries to learn the reward function by supervised learning from observations of past rewards—and that’s the whole thing with TD learning and dopamine.
I added a new sub-bullet at the top to clarify that it’s hard to explain by RL unless you assume the planner can query the ground-truth reward function in arbitrary hypothetical states. And then I also added a new paragraph to the “other possible explanations” section at the bottom saying what I said in the paragraph just above. Thank you.
I don’t see how you solve this problem in general in a sample-efficient manner otherwise.
Well, the rats are trying to do the rewarding thing after zero samples, so I don’t think “sample-efficiency” is quite the right framing.
In ML today, the reward function is typically a function of states and actions, not “thoughts”. In a brain, the reward can depend directly on what you’re imagining doing or planning to do, or even just what you’re thinking about. That’s my proposal here.
Well, I guess you could say that this is still a “normal MDP”, but where “having thoughts” and “having ideas” etc. are part of the state / action space. But anyway, I think that’s a bit different than how most ML people would normally think about things.
I googled “model-based RL Atari” and the first hit was this which likewise tries to learn the reward function by supervised learning from observations of past rewards (if I understand correctly)
Ah, the “model-based using a model-free RL algorithm” approach :) They learn a world model using supervised learning, and then use PPO (a model-free RL algorithm) to train a policy in it. It sounds odd but it makes sense: you hopefully get much of the sample efficiency of model-based training, while still retaining the state-of-the-art results of model-free RL. You’re right that in this setup, as the actions are being chosen by the (model-free RL) policy, you don’t get any zero-shot generalization.
I added a new sub-bullet at the top to clarify that it’s hard to explain by RL unless you assume the planner can query the ground-truth reward function in arbitrary hypothetical states. And then I also added a new paragraph to the “other possible explanations” section at the bottom saying what I said in the paragraph just above. Thank you.
Thanks for updating the post to clarify this point—I agree with you with the new wording.
In ML today, the reward function is typically a function of states and actions, not “thoughts”. In a brain, the reward can depend directly on what you’re imagining doing or planning to do, or even just what you’re thinking about. That’s my proposal here.
Yes indeed, your proposal is quite different from RL. The closest I can think of to rewards over “thoughts” in ML would be regularization terms that take into account weights or, occasionally, activations—but that’s very crude compared to what you’re proposing.
Thanks for the clarification! I agree if the planner does not have access to the reward function then it will not be able to solve it. Though, as you say, it could explore more given the uncertainty.
Most model-based RL algorithms I’ve seen assume they can evaluate the reward functions in arbitrary states. Moreover, it seems to me like this is the key thing that lets rats solve the problem. I don’t see how you solve this problem in general in a sample-efficient manner otherwise.
One class of model-based RL approaches is based on [model-predictive control](https://en.wikipedia.org/wiki/Model_predictive_control): sample random actions, “rollout” the trajectories in the model, pick the trajectory that had the highest return and then take the first action from that trajectory, then replan. That said, assumptions vary. [iLQR](https://en.wikipedia.org/wiki/Linear%E2%80%93quadratic_regulator) makes the stronger assumption that reward is quadratic and differentiable.
I think methods based on [Monte Carlo tree search](https://en.wikipedia.org/wiki/Monte_Carlo_tree_search) might exhibit something like the problem you discuss. Since they sample actions from a policy trained to maximize reward, they might end up not exploring enough in this novel state if the policy is very confident it should not drink the salt water. That said, they typically include explicit methods for exploration like [UCB](https://en.wikipedia.org/wiki/Thompson_sampling#Upper-Confidence-Bound_(UCB)_algorithms) which should mitigate this.
Hmm. AlphaZero can evaluate the true reward function in arbitrary states. MuZero can’t—it tries to learn the reward function by supervised learning from observations of past rewards (if I understand correctly). I googled “model-based RL Atari” and the first hit was this which likewise tries to learn the reward function by supervised learning from observations of past rewards (if I understand correctly). I’m not intimately familiar with the deep RL literature, I wouldn’t know what’s typical and I’ll take your word for it, but it does seem that both possibilities are out there.
Anyway, I don’t think the neocortex can evaluate the true reward function in arbitrary states, because it’s not a neat mathematical function, it involves messy things like the outputs of millions of pain receptors, hormones sloshing around, the input-output relationships of entire brain subsystems containing tens of millions of neurons, etc. So I presume that the neocortex tries to learn the reward function by supervised learning from observations of past rewards—and that’s the whole thing with TD learning and dopamine.
I added a new sub-bullet at the top to clarify that it’s hard to explain by RL unless you assume the planner can query the ground-truth reward function in arbitrary hypothetical states. And then I also added a new paragraph to the “other possible explanations” section at the bottom saying what I said in the paragraph just above. Thank you.
Well, the rats are trying to do the rewarding thing after zero samples, so I don’t think “sample-efficiency” is quite the right framing.
In ML today, the reward function is typically a function of states and actions, not “thoughts”. In a brain, the reward can depend directly on what you’re imagining doing or planning to do, or even just what you’re thinking about. That’s my proposal here.
Well, I guess you could say that this is still a “normal MDP”, but where “having thoughts” and “having ideas” etc. are part of the state / action space. But anyway, I think that’s a bit different than how most ML people would normally think about things.
Ah, the “model-based using a model-free RL algorithm” approach :) They learn a world model using supervised learning, and then use PPO (a model-free RL algorithm) to train a policy in it. It sounds odd but it makes sense: you hopefully get much of the sample efficiency of model-based training, while still retaining the state-of-the-art results of model-free RL. You’re right that in this setup, as the actions are being chosen by the (model-free RL) policy, you don’t get any zero-shot generalization.
Thanks for updating the post to clarify this point—I agree with you with the new wording.
Yes indeed, your proposal is quite different from RL. The closest I can think of to rewards over “thoughts” in ML would be regularization terms that take into account weights or, occasionally, activations—but that’s very crude compared to what you’re proposing.