Charlie Steiner comments on AlphaGo Zero and capability amplification

Charlie Steiner 9 Jan 2019 3:25 UTC
LW: 3 AF: 2
AF
MCTS works as amplification because you can evaluate future board positions to get a convergent estimate of how well you’re doing—and then eventually someone actually wins the game, which keeps p from departing reality entirely. Importantly, the single thing you’re learning can play the role of the environment, too, by picking the opponents’ moves.

In trying to train A to predict human actions given access to A, you’re almost doing something similar. You have a prediction that’s also supposed to be a prediction of the environment (the human), so you can use it for both sides of a tree search. But A isn’t actually searching through an interesting tree—it’s searching for cycles of length 1 in its own model of the environment, with no particular guarantee that any cycles of length 1 exist or are a good idea. “Tree search” in this context (I think) means spraying out a bunch of outputs and hoping at least one falls into a fixed point upon iteration.

EDIT: Big oops, I didn’t actually understand what was being talked about here.
- paulfchristiano 9 Jan 2019 17:14 UTC
  LW: 4 AF: 2
  AF Parent
  I agree there is a real sense in which AGZ is “better-grounded” (and more likely to be stable) than iterated amplification in general. (This was some of the motivation for the experiments here.)
  - Charlie Steiner 10 Jan 2019 0:21 UTC
    LW: 1 AF: 1
    AF Parent
    Oh, I’ve just realized that the “tree” was always intended to be something like task decomposition. Sorry about that—that makes the analogy a lot tighter.
- Gurkenglas 9 Jan 2019 12:26 UTC
  1 point
  Parent
  Isn’t A also grounded in reality by eventually giving no A to consult with?
  - Charlie Steiner 10 Jan 2019 8:15 UTC
    1 point
    Parent
    This is true when getting training data, but I think it’s a difference between A (or HCH) and AlphaGo Zero when doing simulation / amplification. Someone wins a simulated game of Go even if both players are making bad moves (or even random moves), which gives you a signal that A doesn’t have access to.