Let’s talk about model-free RL (leaving aside whether it’s relevant to neuroscience—I think it mostly isn’t).
If you have a parametrized reward function R(a,b,c…), then you can also send the parameters a,b,c as “interoceptive inputs” informing the policy. And then the policy would (presumably) gradually learn to take appropriate actions that vary with the reward function.
I actually think it’s kinda meaningless to say that the reward function is parametrized in the first place. If I say “a,b,c,… are parameters that change a parametrized reward function”, and you say “a,b,c,… are environmental variables that are relevant to the reward function” … are we actually disagreeing about anything of substance? I think we aren’t. In either case, you can do a lot better faster if the policy has direct access to a,b,c,… among its sensory inputs, and if a,b,c,… contribute to reward in a nice smooth way, etc.
For example, let’s say there are three slot machines. Every now and then, their set of odds totally changes, with no external indication. Whenever the switchover happens, I would make bad decisions for a while until I learned to adapt to the new odds. I claim that this is isomorphic to a different problem where the slot machines are the same, but each of them spits out food sometimes, and friendship sometimes, and rest sometimes, with different odds, and meanwhile my physiological state sometimes suddenly changes, and where I have no interoceptive access to that. When my physiological state changes, I would make bad decisions for a while until I learned to adapt to the new reward function. In the first case, I do better when there’s an indicator light that encodes the current odds of the three slot machines. In the second case, I do better with interoceptive access to how hungry and sleepy I am. So in all respects, I think the two situations are isomorphic. But only one of them seems to have a parametrized reward function.
Let’s talk about model-free RL (leaving aside whether it’s relevant to neuroscience—I think it mostly isn’t).
If you have a parametrized reward function R(a,b,c…), then you can also send the parameters a,b,c as “interoceptive inputs” informing the policy. And then the policy would (presumably) gradually learn to take appropriate actions that vary with the reward function.
I actually think it’s kinda meaningless to say that the reward function is parametrized in the first place. If I say “a,b,c,… are parameters that change a parametrized reward function”, and you say “a,b,c,… are environmental variables that are relevant to the reward function” … are we actually disagreeing about anything of substance? I think we aren’t. In either case, you can do a lot better faster if the policy has direct access to a,b,c,… among its sensory inputs, and if a,b,c,… contribute to reward in a nice smooth way, etc.
For example, let’s say there are three slot machines. Every now and then, their set of odds totally changes, with no external indication. Whenever the switchover happens, I would make bad decisions for a while until I learned to adapt to the new odds. I claim that this is isomorphic to a different problem where the slot machines are the same, but each of them spits out food sometimes, and friendship sometimes, and rest sometimes, with different odds, and meanwhile my physiological state sometimes suddenly changes, and where I have no interoceptive access to that. When my physiological state changes, I would make bad decisions for a while until I learned to adapt to the new reward function. In the first case, I do better when there’s an indicator light that encodes the current odds of the three slot machines. In the second case, I do better with interoceptive access to how hungry and sleepy I am. So in all respects, I think the two situations are isomorphic. But only one of them seems to have a parametrized reward function.