It’s a little bit less dramatic than that: the model-based simulation playing is interleaved with the groundtruth environment. It’s more like you spend a year playing games in your head, then you play 1 30s bullet chess match with Magnus Carlsen (madeup ratio), then go back to playing in your head for another year. Or maybe we should say, “you clone yourself a thousand times, and play yourself at correspondence chess timescales for 1 game per pair in a training montage, and then go back for a rematch”.
(The scenario where you play for 15 minutes at the beginning, and then pass a few subjective eons trying to master the game with no further real data, would correspond to sample-efficient “offline reinforcement learning” (review), eg https://arxiv.org/abs/2104.06294#deepmind / https://arxiv.org/abs/2111.05424 , https://arxiv.org/abs/2006.13888https://arxiv.org/abs/2006.13888 . Very important in its own right, of course, but poses challenges of its own related to quality of those 15 minutes—what if your 15 minute sample doesn’t include any games which happen to use en passant, say? How could you ever learn that should be part of your model of chess? When you interleave and learn online, Carlsen might surprise you with an en passant a few games in, and then all your subsequent imagined games can include that possible move and you use it yourself.).
But it would maybe deflate some of the broader conclusions that people are drawing from EfficientZero, like how AI compares to human brains…
I don’t think it does. If humans can’t learn efficiently by imagining hypothetical games like machines can, so much the worse for them. The goal is to win.
It’s a little bit less dramatic than that: the model-based simulation playing is interleaved with the groundtruth environment. It’s more like you spend a year playing games in your head, then you play 1 30s bullet chess match with Magnus Carlsen (madeup ratio), then go back to playing in your head for another year. Or maybe we should say, “you clone yourself a thousand times, and play yourself at correspondence chess timescales for 1 game per pair in a training montage, and then go back for a rematch”.
(The scenario where you play for 15 minutes at the beginning, and then pass a few subjective eons trying to master the game with no further real data, would correspond to sample-efficient “offline reinforcement learning” (review), eg https://arxiv.org/abs/2104.06294#deepmind / https://arxiv.org/abs/2111.05424 , https://arxiv.org/abs/2006.13888 https://arxiv.org/abs/2006.13888 . Very important in its own right, of course, but poses challenges of its own related to quality of those 15 minutes—what if your 15 minute sample doesn’t include any games which happen to use en passant, say? How could you ever learn that should be part of your model of chess? When you interleave and learn online, Carlsen might surprise you with an en passant a few games in, and then all your subsequent imagined games can include that possible move and you use it yourself.).
I don’t think it does. If humans can’t learn efficiently by imagining hypothetical games like machines can, so much the worse for them. The goal is to win.