I want to do a big, long, detailed explainer on the lineage of EfficientZero, which is fascinating, and the mutations it makes in that lineage. This is not that. But here’s my attempt at a quick ELI5, or maybe ELI12
There are two broad flavors of reinforcement learning—where reinforcement learning is simply “learning to act in an environment to maximize a reward / learning to act to make a number go up.”
Model Free RL: This is the kind of execution algorithm you (sort of) execute when you’re keeping a bike upright.
When keeping a bike upright, you don’t form an image of the physics of the bike, and how turning the steering wheel a certain way will adjust the base of the bike relative to your center of gravity. Nor do you do this when learning to ride a bike. Instead, when learning, you do things, roughly guided by what you want to happen, and they cause you to fall off or not, in a very tight feedback loop, and eventually you learn to do the things that cause you not to fall off.
This kind of RL involves no detailed predictive model of the dynamics from the world. It instead learns, broadly, a kind of mapping from observations to actions which accomplish your goal without understanding the dynamics of the world.
Model Based RL: This is the kind of execution algorithm you (sort of) execute when programming fizzbuzz.
That is, you have a particular view of how code executes, and how the interpreter interacts with code that you write. You form an image of what kind of output you’d like, and think of what you need to do make the interpreter give you this output. Learning to write fizzbuzz involves repeatedly refining your view of how code runs until you can just plan out how to write fizzbuzz using that model of how code runs.
So this kind of RL involves a detailed predictive model of the world. It learns a detailed predictive model, which is used to generate hypothetical scenarios, and lets you plan for the world that you want by examining different hypothetical scenarios.
(Note that in actual human activity, pretty much all action is like a mix of the two—this is a simplification. And that indeed many kinds of RL also mix the two.)
How does this relate to EfficientZero?
The original 2014 DeepMind Atari paper used exclusively model-free RL. Many subsequent successful big papers also used model-free RL. So the Dota2-playing agents by OpenAI used model-free RL; the AlphaStar Starcraft-playing agents by DeepMind used model-free RL; and even the recent open-ended learning paper by DeepMind used model-free RL.
And here you or any reader is probably like—doesn’t model free RL leave out a biiiiig part of human cognition? Don’t I… like… sometimes… plan ahead, and think about how the world is, rather than acting on a sophisticated, semi-socially generated, generalizing algorithm that simply maps circumstances to actions? Don’t I form gears-level models of the world and act on them, at least sometimes?
And of course this is so. But—for a long time—our ability to make model-based algorithms work just wasn’t there, or wasn’t there as much. (Vast oversimplification—this has been an active area of research for years.)
And part of the problem with these model-free algorithms was that they can take tons of data to learn, probably because they were trying to solve problems humans solve in an at least partly model-based way, in an entirely model-free way. The StarCraft agents had, I think, something like the equivalent of two centuries of experience playing Starcraft—their sample-efficiency is extremely bad. And similarly, even with this experience, they were rather brittle—if you could get them into a sufficiently weird situation, such that their mapping of observations to actions didn’t cover it, they started to act poorly because they had no model of what was going on to fall back on.
EfficientZero comes from a lineage of research that started with model-based execution, but where initially the model wasn’t learned. This lineage starts—if you have to choose a place to start—with AlphaGo.
AlphaGo was DeepMind’s first Go-playing agent. It—and its successor, AlphaZero—use a perfect model of the game it is playing to plan ahead. It uses neural network magic to prune the search, but fundamentally it plans (and learns) with a perfect, unlearned model of the game is is playing. Which is of course easy to make, because Go is perfectly deterministic.
MuZero is the big leap that followed AlphaGo—if EfficientNet leads to AGI, MuZero might be the paper the historians give precedence to. It replaced the pre-built model of the environment in AlphaGo / AlphaZero with a learned model. This allowed the same algorithm to play Go and Atari, which was unprecedented. And the algorithm itself was (in my opinion) pretty elegant.
EfficientZero (the new paper) is a further reasonable extension of MuZero. It uses some tricks for learning a better model of the environment fast, and also for making the model of the environment more useful for long-term planning. These are interesting additions, and I want to cover them in depth, but it’s important to know that the additions aren’t anything fundamentally new—they’re more-of-the-same-kind-of thing, not a Copernican revolution.
But the upshot (finally) is that model based RL can now do (on average) better than a human on the Atari testbed, after playing for only 2 hours real world time. That is incredibly fast for an RL agent. This sample-efficiency is better than any model-free agent on the Atari testbed, despite the fact that (I think) much more research effort has gone into these model free agents.
And it’s with an algorithm that people fundamentally expect to do better at difficult planning tasks than any model-free algorithm. I.e., we have priors that actual intelligence probably uses something vaguely like this. (There’s also reason to think, imho, that existing implementations of exploration techniques—i.e., curiosity—would mesh really well with this algorithm, which is alarming.)
If in the next 18 months, someone tries applying to something like EfficientZero to a big RL problem like Starcraft, or Minecraft, or the DeepMind open-ended learning tasks, and people find an increase in sample-efficiency similar to the increase observed in Atari, then that’s another blow against long horizons for AGI. If I worked at OpenAI, or DeepMind, I know that’s exactly what I’d be doing.
TLDR: Reinforcement learning algorithms often learn realllly slowly, in terms of needing a lot of experience, or learn really stupidly, in terms of relying on model-free RL. EfficientZero learns quickly and with model-based RL. Will it scale? If so, long-term AGI projections take another hit.
I want to do a big, long, detailed explainer on the lineage of EfficientZero, which is fascinating, and the mutations it makes in that lineage. This is not that. But here’s my attempt at a quick ELI5, or maybe ELI12
There are two broad flavors of reinforcement learning—where reinforcement learning is simply “learning to act in an environment to maximize a reward / learning to act to make a number go up.”
Model Free RL: This is the kind of execution algorithm you (sort of) execute when you’re keeping a bike upright.
When keeping a bike upright, you don’t form an image of the physics of the bike, and how turning the steering wheel a certain way will adjust the base of the bike relative to your center of gravity. Nor do you do this when learning to ride a bike. Instead, when learning, you do things, roughly guided by what you want to happen, and they cause you to fall off or not, in a very tight feedback loop, and eventually you learn to do the things that cause you not to fall off.
This kind of RL involves no detailed predictive model of the dynamics from the world. It instead learns, broadly, a kind of mapping from observations to actions which accomplish your goal without understanding the dynamics of the world.
Model Based RL: This is the kind of execution algorithm you (sort of) execute when programming fizzbuzz.
That is, you have a particular view of how code executes, and how the interpreter interacts with code that you write. You form an image of what kind of output you’d like, and think of what you need to do make the interpreter give you this output. Learning to write fizzbuzz involves repeatedly refining your view of how code runs until you can just plan out how to write fizzbuzz using that model of how code runs.
So this kind of RL involves a detailed predictive model of the world. It learns a detailed predictive model, which is used to generate hypothetical scenarios, and lets you plan for the world that you want by examining different hypothetical scenarios.
(Note that in actual human activity, pretty much all action is like a mix of the two—this is a simplification. And that indeed many kinds of RL also mix the two.)
How does this relate to EfficientZero?
The original 2014 DeepMind Atari paper used exclusively model-free RL. Many subsequent successful big papers also used model-free RL. So the Dota2-playing agents by OpenAI used model-free RL; the AlphaStar Starcraft-playing agents by DeepMind used model-free RL; and even the recent open-ended learning paper by DeepMind used model-free RL.
And here you or any reader is probably like—doesn’t model free RL leave out a biiiiig part of human cognition? Don’t I… like… sometimes… plan ahead, and think about how the world is, rather than acting on a sophisticated, semi-socially generated, generalizing algorithm that simply maps circumstances to actions? Don’t I form gears-level models of the world and act on them, at least sometimes?
And of course this is so. But—for a long time—our ability to make model-based algorithms work just wasn’t there, or wasn’t there as much. (Vast oversimplification—this has been an active area of research for years.)
And part of the problem with these model-free algorithms was that they can take tons of data to learn, probably because they were trying to solve problems humans solve in an at least partly model-based way, in an entirely model-free way. The StarCraft agents had, I think, something like the equivalent of two centuries of experience playing Starcraft—their sample-efficiency is extremely bad. And similarly, even with this experience, they were rather brittle—if you could get them into a sufficiently weird situation, such that their mapping of observations to actions didn’t cover it, they started to act poorly because they had no model of what was going on to fall back on.
EfficientZero comes from a lineage of research that started with model-based execution, but where initially the model wasn’t learned. This lineage starts—if you have to choose a place to start—with AlphaGo.
AlphaGo was DeepMind’s first Go-playing agent. It—and its successor, AlphaZero—use a perfect model of the game it is playing to plan ahead. It uses neural network magic to prune the search, but fundamentally it plans (and learns) with a perfect, unlearned model of the game is is playing. Which is of course easy to make, because Go is perfectly deterministic.
MuZero is the big leap that followed AlphaGo—if EfficientNet leads to AGI, MuZero might be the paper the historians give precedence to. It replaced the pre-built model of the environment in AlphaGo / AlphaZero with a learned model. This allowed the same algorithm to play Go and Atari, which was unprecedented. And the algorithm itself was (in my opinion) pretty elegant.
EfficientZero (the new paper) is a further reasonable extension of MuZero. It uses some tricks for learning a better model of the environment fast, and also for making the model of the environment more useful for long-term planning. These are interesting additions, and I want to cover them in depth, but it’s important to know that the additions aren’t anything fundamentally new—they’re more-of-the-same-kind-of thing, not a Copernican revolution.
But the upshot (finally) is that model based RL can now do (on average) better than a human on the Atari testbed, after playing for only 2 hours real world time. That is incredibly fast for an RL agent. This sample-efficiency is better than any model-free agent on the Atari testbed, despite the fact that (I think) much more research effort has gone into these model free agents.
And it’s with an algorithm that people fundamentally expect to do better at difficult planning tasks than any model-free algorithm. I.e., we have priors that actual intelligence probably uses something vaguely like this. (There’s also reason to think, imho, that existing implementations of exploration techniques—i.e., curiosity—would mesh really well with this algorithm, which is alarming.)
If in the next 18 months, someone tries applying to something like EfficientZero to a big RL problem like Starcraft, or Minecraft, or the DeepMind open-ended learning tasks, and people find an increase in sample-efficiency similar to the increase observed in Atari, then that’s another blow against long horizons for AGI. If I worked at OpenAI, or DeepMind, I know that’s exactly what I’d be doing.
TLDR: Reinforcement learning algorithms often learn realllly slowly, in terms of needing a lot of experience, or learn really stupidly, in terms of relying on model-free RL. EfficientZero learns quickly and with model-based RL. Will it scale? If so, long-term AGI projections take another hit.
I’m looking forward to that big, long, detailed explainer :)
s/EfficientNet/EfficientZero?
Thanks! This was super helpful.