Nevan Wichers comments on Matt Botvinick on the spontaneous emergence of learning algorithms

Nevan Wichers 12 Aug 2020 19:29 UTC
LW: 13 AF: 9
AF
I don’t think that paper is an example of mesa optimization. Because the policy could be implementing a very simple heuristic to solve the task, similar to: Pick the image that lead to highest reward in the last 10 timesteps with 90% probability. Pik an image at random with 10% probability.

So the policy doesn’t have to have any properties of a mesa optimizer like considering possible actions and evaluating them with a utility function, ect.

Whenever an RL is trained in a partially observed environment, the agent has to take actions to learn about parts of its environment that it hasn’t observed yet or may have changed. The difference with this paper is that the observations it gets from the environment happen to be the reward the agent received in the previous timestep. However as far as the policy is concerned, the reward it gets as input is just another component of the state. So the fact that the policy gets the previous reward as input doesn’t make it stand out compared to another partially observed environment.
What links here?
- Mesa-Search vs Mesa-Control by abramdemski (18 Aug 2020 18:51 UTC; 55 points)
- gwern 20 Aug 2020 23:58 UTC
  LW: 24 AF: 9
  AF Parent
  The argument that these and other meta-RL researchers usually make is that (as indicated by the various neurons which fluctuate, and I think based on some other parts of their experiments which I would have to reread it to list) what these RNNs are learning is not just a simple play-the-winner heuristic (which is suboptimal, and your suggestion would require only 1 neuron to track the winning arm) but amortized Bayesian inference where the internal dynamics are learning the sufficient statistics of the Bayes-optimal solution to the POMDP (where you’re unsure what of a large family of MDPs you’re in): “Meta-learning of Sequential Strategies”, Ortega et al 2019; “Reinforcement Learning, Fast and Slow”, Botvinick et al 2019; “Meta-learners’ learning dynamics are unlike learners’”, Rabinowitz 2019; “Bayesian Reinforcement Learning: A Survey”, Ghavamzadeh et al 2016, are some of the papers that come to mind. Then you can have a fairly simple decision rule using that as the input (eg Figure 4 of Ortega on a coin-flipping example, which is a setup near & dear to my heart).
  
  To reuse a quote from my backstop essay: as Duff 2002 puts it,
  
  “One way of thinking about the computational procedures that I later propose is that they perform an offline computation of an online, adaptive machine. One may regard the process of approximating an optimal policy for the Markov decision process defined over hyper-states as ‘compiling’ an optimal learning strategy, which can then be ‘loaded’ into an agent.”
- abramdemski 18 Aug 2020 18:59 UTC
  LW: 7 AF: 4
  AF Parent
  I made some remarks going partly off of your comment into a post: https://www.alignmentforum.org/posts/WmBukJkEFM72Xr397/mesa-search-vs-mesa-control