One solution is to have an RL policy that chooses where to update the model, but to use self-supervised (predictive) learning to decide what direction to update the model. For example, the RL policy can choose what to look at / attend to / think about, or more basically, “what to try predicting”, then the model makes that prediction, then we update the model on the prediction error.
Then the RL policy can learn to take actions that account for value of information.
(This comment is loosely based on one aspect of what I think the brain does.)
I would hestitate to call it a solution because my motivation is in proving that partially optimized models are “better than nothing” in some sense, and transforming prediction problems into RL problems sounds like something that makes proofs much harder, rather than easier. But I agree that it could mitigate the problems that arise in practice, even if it doesn’t solve the theory.
One solution is to have an RL policy that chooses where to update the model, but to use self-supervised (predictive) learning to decide what direction to update the model. For example, the RL policy can choose what to look at / attend to / think about, or more basically, “what to try predicting”, then the model makes that prediction, then we update the model on the prediction error.
Then the RL policy can learn to take actions that account for value of information.
(This comment is loosely based on one aspect of what I think the brain does.)
I would hestitate to call it a solution because my motivation is in proving that partially optimized models are “better than nothing” in some sense, and transforming prediction problems into RL problems sounds like something that makes proofs much harder, rather than easier. But I agree that it could mitigate the problems that arise in practice, even if it doesn’t solve the theory.