You mentioned that this metaphor should also include world models. I can help there.
Many world models try to predict the next state of the world given the agent’s action. With curiosity-driven exploration the agent tries to explore in a way that maximizes it’s a reduction of surprise, allowing it to learn about its effect on the world (see for example https://arxiv.org/abs/1705.05363). Why not just maximize surprise? Because we want a surprise we can learn to decrease, not just the constant surprise of a TV showing static.
This means they focus an exploration reward on finding novel states. Specifically, novel states that are due to the agent’s actions, since those are the most salient. We could rephrase this as “novel changes the agent has control over”. But what is defined as an action, and what can it control?
Meditation changes where we draw the boundary between the agent and the environment. The no-self insight and lets you view thoughts as external things arising outside of your control. The impermanence insight lets you view more things as outside your control.
These two changes in perspective mean that an agent no longer experiences negative reward for states it now thinks it has no control over. It can also do reward hacking on its own thoughts since they are now “external” and targets of exploration rewards. Previously it could only learn patterns of thought with reference to some external goal, not it can learn a pattern of thought directly.
Disclaimer: world models and curiosity-driven exploration are at an early stage, and probably have a poor correspondence to how our brains work. There are quite a few unsolved problems like the noisy TV problem.
You mentioned that this metaphor should also include world models. I can help there.
Many world models try to predict the next state of the world given the agent’s action. With curiosity-driven exploration the agent tries to explore in a way that maximizes it’s a reduction of surprise, allowing it to learn about its effect on the world (see for example https://arxiv.org/abs/1705.05363). Why not just maximize surprise? Because we want a surprise we can learn to decrease, not just the constant surprise of a TV showing static.
This means they focus an exploration reward on finding novel states. Specifically, novel states that are due to the agent’s actions, since those are the most salient. We could rephrase this as “novel changes the agent has control over”. But what is defined as an action, and what can it control?
Meditation changes where we draw the boundary between the agent and the environment. The no-self insight and lets you view thoughts as external things arising outside of your control. The impermanence insight lets you view more things as outside your control.
These two changes in perspective mean that an agent no longer experiences negative reward for states it now thinks it has no control over. It can also do reward hacking on its own thoughts since they are now “external” and targets of exploration rewards. Previously it could only learn patterns of thought with reference to some external goal, not it can learn a pattern of thought directly.
Disclaimer: world models and curiosity-driven exploration are at an early stage, and probably have a poor correspondence to how our brains work. There are quite a few unsolved problems like the noisy TV problem.