I don’t think that paper is an example of mesa optimization. Because the policy could be implementing a very simple heuristic to solve the task, similar to:
Pick the image that lead to highest reward in the last 10 timesteps with 90% probability.
Pik an image at random with 10% probability.
So the policy doesn’t have to have any properties of a mesa optimizer like considering possible actions and evaluating them with a utility function, ect.
Whenever an RL is trained in a partially observed environment, the agent has to take actions to learn about parts of its environment that it hasn’t observed yet or may have changed. The difference with this paper is that the observations it gets from the environment happen to be the reward the agent received in the previous timestep. However as far as the policy is concerned, the reward it gets as input is just another component of the state. So the fact that the policy gets the previous reward as input doesn’t make it stand out compared to another partially observed environment.
The argument that these and other meta-RL researchers usually make is that (as indicated by the various neurons which fluctuate, and I think based on some other parts of their experiments which I would have to reread it to list) what these RNNs are learning is not just a simple play-the-winner heuristic (which is suboptimal, and your suggestion would require only 1 neuron to track the winning arm) but amortized Bayesian inference where the internal dynamics are learning the sufficient statistics of the Bayes-optimal solution to the POMDP (where you’re unsure what of a large family of MDPs you’re in): “Meta-learning of Sequential Strategies”, Ortega et al 2019; “Reinforcement Learning, Fast and Slow”, Botvinick et al 2019; “Meta-learners’ learning dynamics are unlike learners’”, Rabinowitz 2019; “Bayesian Reinforcement Learning: A Survey”, Ghavamzadeh et al 2016, are some of the papers that come to mind. Then you can have a fairly simple decision rule using that as the input (eg Figure 4 of Ortega on a coin-flipping example, which is a setup near & dear to my heart).
“One way of thinking about the computational procedures that I later propose is that they perform an offline computation of an online, adaptive machine. One may regard the process of approximating an optimal policy for the Markov decision process defined over hyper-states as ‘compiling’ an optimal learning strategy, which can then be ‘loaded’ into an agent.”
I don’t think that paper is an example of mesa optimization. Because the policy could be implementing a very simple heuristic to solve the task, similar to: Pick the image that lead to highest reward in the last 10 timesteps with 90% probability. Pik an image at random with 10% probability.
So the policy doesn’t have to have any properties of a mesa optimizer like considering possible actions and evaluating them with a utility function, ect.
Whenever an RL is trained in a partially observed environment, the agent has to take actions to learn about parts of its environment that it hasn’t observed yet or may have changed. The difference with this paper is that the observations it gets from the environment happen to be the reward the agent received in the previous timestep. However as far as the policy is concerned, the reward it gets as input is just another component of the state. So the fact that the policy gets the previous reward as input doesn’t make it stand out compared to another partially observed environment.
The argument that these and other meta-RL researchers usually make is that (as indicated by the various neurons which fluctuate, and I think based on some other parts of their experiments which I would have to reread it to list) what these RNNs are learning is not just a simple play-the-winner heuristic (which is suboptimal, and your suggestion would require only 1 neuron to track the winning arm) but amortized Bayesian inference where the internal dynamics are learning the sufficient statistics of the Bayes-optimal solution to the POMDP (where you’re unsure what of a large family of MDPs you’re in): “Meta-learning of Sequential Strategies”, Ortega et al 2019; “Reinforcement Learning, Fast and Slow”, Botvinick et al 2019; “Meta-learners’ learning dynamics are unlike learners’”, Rabinowitz 2019; “Bayesian Reinforcement Learning: A Survey”, Ghavamzadeh et al 2016, are some of the papers that come to mind. Then you can have a fairly simple decision rule using that as the input (eg Figure 4 of Ortega on a coin-flipping example, which is a setup near & dear to my heart).
To reuse a quote from my backstop essay: as Duff 2002 puts it,
I made some remarks going partly off of your comment into a post: https://www.alignmentforum.org/posts/WmBukJkEFM72Xr397/mesa-search-vs-mesa-control