This is definitely possible and is essentially augmenting the state variables with additional homeostatic variables and then learning policies on the joint state space. However there are some clever experiments such as the linked Morrison and Berridge one demonstrating that this is not all that is going on—specifically many animals appear to be able to perform zero-shot changes in policy when rewards change even if they have not experienced this specific homeostatic variable before—I.e. mice suddenly chase after salt water which they previously disliked when put in a state of salt deprivation which they had never before experienced
The above is describing the model-free component of learning reward-function dependent policies. The Morrison and Berridge salt experiment is demonstrating the model-based side, which probably comes from imagining specific outcomes and how they’d feel.
This is where I disagree! I don’t think the Morrison and Berridge experiment demonstrates model-based side. It is consistent with model-based RL but is also consistent with model-free algorithms that can flexibly adapt to changing reward functions such as linear RL. Personally, I think this latter is more likely since it is such a low level response which can be modulated entirely by subcortical systems and so seems unlikely to require model-based planning to work
This is definitely possible and is essentially augmenting the state variables with additional homeostatic variables and then learning policies on the joint state space. However there are some clever experiments such as the linked Morrison and Berridge one demonstrating that this is not all that is going on—specifically many animals appear to be able to perform zero-shot changes in policy when rewards change even if they have not experienced this specific homeostatic variable before—I.e. mice suddenly chase after salt water which they previously disliked when put in a state of salt deprivation which they had never before experienced
The above is describing the model-free component of learning reward-function dependent policies. The Morrison and Berridge salt experiment is demonstrating the model-based side, which probably comes from imagining specific outcomes and how they’d feel.
This is where I disagree! I don’t think the Morrison and Berridge experiment demonstrates model-based side. It is consistent with model-based RL but is also consistent with model-free algorithms that can flexibly adapt to changing reward functions such as linear RL. Personally, I think this latter is more likely since it is such a low level response which can be modulated entirely by subcortical systems and so seems unlikely to require model-based planning to work