But to be clear, (i) it would then also be learned by imitating a large enough dataset from human players who did something like tree search internally while playing, (ii) I think the tree search makes a quantitative not qualitative change, and it’s not that big (mostly improves stability, and *maybe* a 10x speedup, over self-play).
I don’t see how (i) follows? The advantage of (internal) tree search during training is precisely that it constrains you to respond sensibly to situations that are normally very rare (but are easily analyzable once they come up), e.g. “cheap win” strategies that are easily defeated by serious players and hence never come up in serious play.
AGZ is only trained on the situations that actually arise in games it plays.
I agree with the point that “imitation learning from human games” will only make you play well on kinds of situations that arise in human games, and that self-play can do better by making you play well on a broader set of situations. You could also train on all the situations that arise in a bigger tree search (though AGZ did not) or against somewhat-random moves (which AGZ probably did).
(Though I don’t see this as affecting the basic point.)
But to be clear, (i) it would then also be learned by imitating a large enough dataset from human players who did something like tree search internally while playing, (ii) I think the tree search makes a quantitative not qualitative change, and it’s not that big (mostly improves stability, and *maybe* a 10x speedup, over self-play).
I don’t see how (i) follows? The advantage of (internal) tree search during training is precisely that it constrains you to respond sensibly to situations that are normally very rare (but are easily analyzable once they come up), e.g. “cheap win” strategies that are easily defeated by serious players and hence never come up in serious play.
AGZ is only trained on the situations that actually arise in games it plays.
I agree with the point that “imitation learning from human games” will only make you play well on kinds of situations that arise in human games, and that self-play can do better by making you play well on a broader set of situations. You could also train on all the situations that arise in a bigger tree search (though AGZ did not) or against somewhat-random moves (which AGZ probably did).
(Though I don’t see this as affecting the basic point.)
Why just a 10x speedup over model free RL? I would’ve expected much more.