The interesting thing is this is (in part) essentially a synthetic data generation pipeline for world models, when there is a game for which RL can train an agent with reasonable behavior. This way they get arbitrary amounts of data that’s going to be somewhat on-policy for reasonable play, and has all the action labeling to make the resulting world model able to reasonably respond to most possible actions.
They only use 128 TPUv5e for training, which is the lite variant of TPUv5 with only 200 BF16 teraFLOP/s. So this is like 25 H100s, when currently 100K H100s training clusters are coming online. RL that can play video games already somewhat works for open world survival craft and factory building games, see DeepMind SIMA from Mar 2024, so the method can be applied to all these games to build their world models. And a better world model trained from all the games at once (and YouTube) potentially lets model-based RL get really good sample efficiency in novel situations.
The interesting thing is this is (in part) essentially a synthetic data generation pipeline for world models, when there is a game for which RL can train an agent with reasonable behavior. This way they get arbitrary amounts of data that’s going to be somewhat on-policy for reasonable play, and has all the action labeling to make the resulting world model able to reasonably respond to most possible actions.
They only use 128 TPUv5e for training, which is the lite variant of TPUv5 with only 200 BF16 teraFLOP/s. So this is like 25 H100s, when currently 100K H100s training clusters are coming online. RL that can play video games already somewhat works for open world survival craft and factory building games, see DeepMind SIMA from Mar 2024, so the method can be applied to all these games to build their world models. And a better world model trained from all the games at once (and YouTube) potentially lets model-based RL get really good sample efficiency in novel situations.