paulfchristiano comments on AlphaStar: Impressive for RL progress, not for AGI progress

paulfchristiano 3 Nov 2019 3:55 UTC
17 points
I meant to ask about the policy network in AlphaZero directly. It plays at the professional level (the Nature paper puts it at a comparable Elo to Fan Hui) with no tree search, using a standard neural network architecture trained by supervised learning. It performs fine on parts of the search tree that never appeared during training. What distinguishes this kind of reasoning from “if I see X, I do Y”?
(ETA clarification, because I think this was probably the misunderstanding: the policy network plays Go with no tree search, tree search is only used to generate training data. That suggests the AlphaStar algorithm would produce similar behavior without using tree search ever, probably using at most 100x the compute of AlphaZero and I’d be willing to bet on <10x.)
From the outside, it looks like human-level play at Starcraft is more complicated (in a sense) than human-level play at Go, and so it’s going to take bigger models in order to reach a similar level of performance. I don’t see a plausible-looking distinction-in-principle that separates the strategy in Starcraft from strategy in Go.
- nostalgebraist 3 Nov 2019 4:24 UTC
  8 points
  Parent
  IIUC the distinction being made is about the training data, granted the assumption that you may be able to distill tree-search-like abilities into a standard NN with supervised learning if you have samples from tree search available as supervision targets in the first place.
  AGZ was hooked up to a tree search in its training procedure, so its training signal allowed it to learn not just from the game trees it “really experienced” during self-play episodes but also (in a less direct way) from the much larger pool of game trees it “imagined” while searching for its next move during those same episodes. The former is always (definitionally) available in self-play, but the latter is only available if tree search is feasible.
  - paulfchristiano 3 Nov 2019 4:32 UTC
    10 points
    Parent
    But to be clear, (i) it would then also be learned by imitating a large enough dataset from human players who did something like tree search internally while playing, (ii) I think the tree search makes a quantitative not qualitative change, and it’s not that big (mostly improves stability, and *maybe* a 10x speedup, over self-play).
    What links here?
    SoerenMind's comment on [AN #75]: Solving Atari and Go with learned game models, and thoughts from a MIRI employee by Rohin Shah (28 Nov 2019 18:56 UTC; 7 points)
    - nostalgebraist 3 Nov 2019 6:00 UTC
      9 points
      Parent
      I don’t see how (i) follows? The advantage of (internal) tree search during training is precisely that it constrains you to respond sensibly to situations that are normally very rare (but are easily analyzable once they come up), e.g. “cheap win” strategies that are easily defeated by serious players and hence never come up in serious play.
      - paulfchristiano 15 Nov 2019 3:38 UTC
        1 point
        Parent
        AGZ is only trained on the situations that actually arise in games it plays.
        I agree with the point that “imitation learning from human games” will only make you play well on kinds of situations that arise in human games, and that self-play can do better by making you play well on a broader set of situations. You could also train on all the situations that arise in a bigger tree search (though AGZ did not) or against somewhat-random moves (which AGZ probably did).
        (Though I don’t see this as affecting the basic point.)
    - SoerenMind 10 Nov 2019 8:28 UTC
      7 points
      Parent
      Why just a 10x speedup over model free RL? I would’ve expected much more.