The authors then develop their own method, Maia. They talk about it as a “modification of the AlphaZero architecture”, but as far as I can tell it is simply behavior cloning using the neural net architecture used by Leela. As you might expect, this does significantly better, and finally satisfies the property we would intuitively want: the best predictive model for a human of some skill level is the one that was trained on the data from humans at that skill level.
Yeah, I think that’s all they mean: the CNN and input/output are the same as Leela the same as AlphaZero. But it does differ from behavioral cloning in that they stratify the samples—typically, behavior cloning dumps in all available expert samples (perhaps with a minimum cutoff rating, which is how AlphaGo filtered its KGS pretraining) and trains on them all equally.
Personally, I would’ve trained a single conditional model with a specified player-Elo for each move, instead of arbitrarily bucketing into 9 levels of Elo ranges, but perhaps they have so many games that each bucket is enough (12m each as they emphasize) and they preferred to keep it simple and spend data/compute instead of making the training & runtime more complicated.
But it does differ from behavioral cloning in that they stratify the samples
Fair point. In my ontology, “behavior cloning” is always with respect to some expert distribution, so I see the stratified samples as “several instances of behavior cloning with different expert distributions”, but that isn’t a particularly normal or accepted ontology.
Personally, I would’ve trained a single conditional model with a specified player-Elo for each move
Yeah it does seem like this would have worked better—if nothing else, the predictions could be more precise (rather than specifying the bucket in which the current player falls, you can specify their exact ELO instead).
Yeah, I think that’s all they mean: the CNN and input/output are the same as Leela the same as AlphaZero. But it does differ from behavioral cloning in that they stratify the samples—typically, behavior cloning dumps in all available expert samples (perhaps with a minimum cutoff rating, which is how AlphaGo filtered its KGS pretraining) and trains on them all equally.
Personally, I would’ve trained a single conditional model with a specified player-Elo for each move, instead of arbitrarily bucketing into 9 levels of Elo ranges, but perhaps they have so many games that each bucket is enough (12m each as they emphasize) and they preferred to keep it simple and spend data/compute instead of making the training & runtime more complicated.
Fair point. In my ontology, “behavior cloning” is always with respect to some expert distribution, so I see the stratified samples as “several instances of behavior cloning with different expert distributions”, but that isn’t a particularly normal or accepted ontology.
Yeah it does seem like this would have worked better—if nothing else, the predictions could be more precise (rather than specifying the bucket in which the current player falls, you can specify their exact ELO instead).