Wei Dai comments on Conditions for Mesa-Optimization

Wei Dai Sep 18, 2019, 6:29 PM
LW: 5 AF: 4
AF

Humans and systems produced by meta learning both do reasonably well at learning, and don’t do “search” (depending on how loose you are with your definition of “search”).

Part of what inspired me to write my comment was watching my kid play logic puzzles. When she starts a new game, she has to do a lot of random trial-and-error with backtracking, much like MCTS. (She does the trial-and-error on the physical game board, but when I play I often just do it in my head.) Then her intuition builds up and she can start to recognize solutions earlier and earlier in the search tree, sometimes even immediately upon starting a new puzzle level. Then the game gets harder (the puzzle levels slowly increase in difficulty) or moves to a new regime where her intuitions don’t work, and she has to do more trial-and-error again, and so on. This sure seems like “search” to me.

Fwiw, on the original point, even standard machine learning algorithms (not the resulting models) don’t seem like “search” to me, though they also aren’t just a bag of heuristics and they do have a clearly delineated objective, so they fit well enough in the mesa optimization story.

This really confuses me. Maybe with some forms of supervised learning you can either calculate the solution directly, or just follow a gradient (which may be arguable whether that’s search or not), but with RL, surely the “explore” steps have to count as “search”? Do you have a different kind of thing in mind when you think of “search”?
- Rohin Shah Sep 19, 2019, 6:29 AM
  LW: 3 AF: 2
  AF Parent
  This sure seems like “search” to me.
  I agree that if you have a model of the system (as you do when you know the rules of the game), you can simulate potential actions and consequences, and that seems like search.
  Usually, you don’t have a good model of the system, and then you need something else.
  Maybe with some forms of supervised learning you can either calculate the solution directly, or just follow a gradient (which may be arguable whether that’s search or not), but with RL, surely the “explore” steps have to count as “search”?
  I was thinking of following a gradient in supervised learning.
  I agree that pure reinforcement learning with a sparse reward looks like search. I doubt that pure RL with sparse reward is going to get you very far.
  Reinforcement learning with demonstrations or a very dense reward doesn’t really look like search, it looks more like someone telling you what to do and you following the instructions faithfully.