They train for 220k steps for each agent and mention that 100k steps takes 7 hours on 4 GPUs (no mention of which gpus, but maybe RTX3090 would be a good guess?)
They don’t mention it
They are explicitely motivated by robotics control, so yes, they expect this to help in that direction. I think the main problem is that robotics requires more complicated reward-shaping to obtain desired behaviour. In Atari the reward is already computed for you and you just need to maximise it, when designing a robot to put dishes in a dishwasher the rewards need to be crafted by humans. Going from “Desired Behavior → Rewards for RL” is harder than “Rewards for RL → Desired Behavior”
I am somewhat surprised by the simplicity of the 3 methods described in the paper, I update towards “dumb and easy improvements over current methods can lead to drastic changes in performance”.
As far as I can see, their improvements are:
Learn the environment dynamics by self-supervision instead of relying only on reward signals. Meaning that they don’t learn the dynamics end-to-end like in MuZero. For them the loss function for the enviroment dynamics is completely separate from the RL loss function. (I was wrong, they in fact add a similarity loss to the loss function of MuZero that provides extra supervision for learning the dynamics, but gradients from rewards still reach the dynamics and representation networks.)
Instead of having the dynamics model predict future rewards, have it predict a time-window averaged reward (what they call “value prefix”). This means that the model doesn’t need to get the timing of the reward *exactly* right to get a good loss, and so lets the model have a conception of “a reward is coming sometime soon, but I don’t quite know exactly when”
As training progresses the old trajectories sampled with an earlier policy are no longer very useful to the current model, so as each training run gets older, they replace the training run in memory with a model-predicted continuation. I guess it’s like replacing your memories of a 10-year old with imagined “what would I have done” sequences, and the older the memories, the more of them you replace with your imagined decisions.
They train for 220k steps for each agent and mention that 100k steps takes 7 hours on 4 GPUs (no mention of which gpus, but maybe RTX3090 would be a good guess?)
Holy cow, am I reading that right? RTX3090 costs, like, $2000. So they were able to train this whole thing for about one day’s worth of effort using equipment that cost less than $10K in total? That means there’s loads of room to scale this up… It means that they could (say) train a version of this architecture with 1000x more parameters and 100x more training data for about $10M and 100 days. Right?
You’re missing a factor for the number of agents trained (one for each atari game), so in fact this should correspond to about one month of training for the whole game library. More if you want to run each game with multiple random seeds to get good statistics, as you would if you’re publishing a paper. But yeah, for a single task like protein folding or some other crucial RL task that only runs once, this could easily be scaled up a lot with GPT-3 scale money.
How well do you think it would generalize? Like, say we made it 1000x bigger and trained it on 100x more training data, but instead of 1 game for 100x longer it was 100 games? Would it be able to do all the games? Would it be better or worse than models specialized to particular games, of similar size and architecture and training data length?
(Fair warning: I’m definitely in the “amateur” category here. Usual caveats apply—using incorrect terminology, etc, etc. Feel free to correct me.)
> they in fact add a similarity loss to the loss function of MuZero that provides extra supervision
How do they prevent noise traps? That is, picture a maze with featureless grey walls except for a wall that displays TV static. Black and white noise. Most of the rest of the maze looks very similar—but said wall is always very different. The agent ends up penalized for not sitting and watching the TV forever (or encouraged to sit and watch the TV forever. Same difference up to a constant...)
Ah, I understand your confusion, the similarity loss they add to the RL loss function has nothing to do with exploration. It’s not meant to encourage the network to explore less “similar” states, and so is not affected by the noisy-TV problem.
The similarity loss refers to the fact that they are training a “representation network” that takes the raw pixels ot and produces a vector that they take as their state, st=H(ot), they also train a “dynamics network” that predicts st+1 from st and at , ^st+1=G(st,at) . These networks in muZero are trained directly from rewards, yet this is a very noisy and sparse signal for these networks. The authors reason that they need to provide some extra supervision to train these better. What they do is add a loss term that tells the networks that G(H(ot),at) should be very close to H(ot+1). In effect they want the predicted next state ^st+1 to be very similar to the state st+1 that in fact occurs. This is a rather …obvious… addition, of course your prediction network should produce predictions that actually occur. This provides additional gradient signal to train the representation and dynamics networks, and is what the similarity loss refers to.
What they do is add a loss term that tells the networks that G(H(ot),at) should be very close to H(ot+1).
I am confused. Training your “dynamics network” as a predictor is precisely training that G(H(ot),at) should be very close to H(ot+1). (Or rather, as you mention, you’re minimizing the difference between st+1=H(ot+1), and ^st+1=G(st,at)=G(H(ot),at)…) How can you add a loss term that’s already present? (Or, if you’re not training your predictor by comparing predicted with actual and back-propagating error, how are you training it?)
Or are you saying that this is training the combined G(H(ot),at) network, including back-propagation of (part of) the error into updating the weights of H(ot) also? If so, that makes sense. Makes you wonder where else people feed the output of one NN into the input of another NN without back-propagation…
Or, if you’re not training your predictor by comparing predicted with actual and back-propagating error, how are you training it?
MuZero is training the predictor by using a neural network in the “place in the algorithm where a predictor function would go”, and then training the parameters of that network by backpropagating rewards through the network. So in MuZero the “predictor network” is not explicitely trained as you would think a predictor would be, it’s only guided by rewards, the predictor network only knows that it should produce stuff that gives more reward, there’s no sense of being a good predictor for its own sake. And the advance of this paper is to say “what if the place in our algorithm where a predictor function should go was actually trained like a predictor?”. Check out equation 6 on page 15 of the paper.
Learn the environment dynamics by self-supervision instead of relying only on reward signals. Meaning that they don’t learn the dynamics end-to-end like in MuZero. For them the loss function for the enviroment dynamics is completely separate from the RL loss function.
I wonder how they prevent the latent state representation of observations from collapsing into a zero-vector, thus becoming completely uninformative and trivially predictable. And if this was the reason MuZero did things its way.
There is a term in the loss function reflecting the disparity between observed rewards and rewards predicted from the state sequence (first term of Lt(θ) in equation (6)). If the state representation collapsed it would be impossible to predict rewards from it. The third term in the loss function would also punish you: it compares the value computed from the state to a linear combination of rewards and the value computed from the state at a different step (see equation (4) for definition of zt).
No, they are training all the networks together. The original MuZero didn’t have Lsimilarity, it learned the dynamics only via the reward-prediction terms.
5. What is the update / implication of this, in your opinion?
Personal opinion:
Progress in model-based RL is far more relevant to getting us closer to AGI than other fields like NLP or image recognition or neuroscience or ML hardware. I worry that once the research community shifts its focus towards RL, the AGI timeline will collapse—not necessarily because there are no more critical insights left to be discovered, but because it’s fundamentally the right path to work on and whatever obstacles remain will buckle quickly once we throw enough warm bodies at them. I think—and this is highly controversial—that the focus on NLP and Vision Transformer has served as a distraction for a couple of years and actually delayed progress towards AGI.
If curiosity-driven exploration gets thrown into the mix and Starcraft/Dota gets solved (for real this time) with comparable data efficiency as humans, that would be a shrieking fire alarm to me (but not to many other people I imagine, as “this has all been done before”).
(1) Same architecture and hyperparameters, trained separately on every game.
(4) It might work. In fact they also tested it on a benchmark that involves controlling a robot in a simulation, and showed it beats state-of-the-art on the same amount of training data (but there is no “human performance” to compare to).
(5) The poor sample complexity was one of the strongest arguments for why deep learning is not enough for AGI. So, this is a significant update in the direction of “we don’t need that many more new ideas to reach AGI”. Another implication is that model-based RL seems to be pulling way ahead of model-free RL.
I don’t think it does, and reskimming the paper I don’t see any claim it does (using a single network seems to have been largely neglected since Popart). Prabhu might be thinking of how it uses a single fixed network architecture & set of hyperparameters across all games (which while showing generality, doesn’t give any transfer learning or anything).
Some basic questions in case anyone knows and wants to help me out:
1. Is this a single neural net that can play all the Atari games well, or a different net for each game?
2. How much compute was spent on training?
3. How many parameters?
4. Would something like this work for e.g. controlling a robot using only a few hundred hours of training data? If not, why not?
5. What is the update / implication of this, in your opinion?
(I did skim the paper and use the search bar, but was unable to answer these questions myself, probably due to lack of expertise)
Different networks for each game
They train for 220k steps for each agent and mention that 100k steps takes 7 hours on 4 GPUs (no mention of which gpus, but maybe RTX3090 would be a good guess?)
They don’t mention it
They are explicitely motivated by robotics control, so yes, they expect this to help in that direction. I think the main problem is that robotics requires more complicated reward-shaping to obtain desired behaviour. In Atari the reward is already computed for you and you just need to maximise it, when designing a robot to put dishes in a dishwasher the rewards need to be crafted by humans. Going from “Desired Behavior → Rewards for RL” is harder than “Rewards for RL → Desired Behavior”
I am somewhat surprised by the simplicity of the 3 methods described in the paper, I update towards “dumb and easy improvements over current methods can lead to drastic changes in performance”.
As far as I can see, their improvements are:
Learn the environment dynamics by self-supervision instead of relying only on reward signals.
Meaning that they don’t learn the dynamics end-to-end like in MuZero. For them the loss function for the enviroment dynamics is completely separate from the RL loss function.(I was wrong, they in fact add a similarity loss to the loss function of MuZero that provides extra supervision for learning the dynamics, but gradients from rewards still reach the dynamics and representation networks.)Instead of having the dynamics model predict future rewards, have it predict a time-window averaged reward (what they call “value prefix”). This means that the model doesn’t need to get the timing of the reward *exactly* right to get a good loss, and so lets the model have a conception of “a reward is coming sometime soon, but I don’t quite know exactly when”
As training progresses the old trajectories sampled with an earlier policy are no longer very useful to the current model, so as each training run gets older, they replace the training run in memory with a model-predicted continuation. I guess it’s like replacing your memories of a 10-year old with imagined “what would I have done” sequences, and the older the memories, the more of them you replace with your imagined decisions.
Holy cow, am I reading that right? RTX3090 costs, like, $2000. So they were able to train this whole thing for about one day’s worth of effort using equipment that cost less than $10K in total? That means there’s loads of room to scale this up… It means that they could (say) train a version of this architecture with 1000x more parameters and 100x more training data for about $10M and 100 days. Right?
You’re missing a factor for the number of agents trained (one for each atari game), so in fact this should correspond to about one month of training for the whole game library. More if you want to run each game with multiple random seeds to get good statistics, as you would if you’re publishing a paper. But yeah, for a single task like protein folding or some other crucial RL task that only runs once, this could easily be scaled up a lot with GPT-3 scale money.
Ah right, thanks!
How well do you think it would generalize? Like, say we made it 1000x bigger and trained it on 100x more training data, but instead of 1 game for 100x longer it was 100 games? Would it be able to do all the games? Would it be better or worse than models specialized to particular games, of similar size and architecture and training data length?
(Fair warning: I’m definitely in the “amateur” category here. Usual caveats apply—using incorrect terminology, etc, etc. Feel free to correct me.)
> they in fact add a similarity loss to the loss function of MuZero that provides extra supervision
How do they prevent noise traps? That is, picture a maze with featureless grey walls except for a wall that displays TV static. Black and white noise. Most of the rest of the maze looks very similar—but said wall is always very different. The agent ends up penalized for not sitting and watching the TV forever (or encouraged to sit and watch the TV forever. Same difference up to a constant...)
Ah, I understand your confusion, the similarity loss they add to the RL loss function has nothing to do with exploration. It’s not meant to encourage the network to explore less “similar” states, and so is not affected by the noisy-TV problem.
The similarity loss refers to the fact that they are training a “representation network” that takes the raw pixels ot and produces a vector that they take as their state, st=H(ot), they also train a “dynamics network” that predicts st+1 from st and at , ^st+1=G(st,at) . These networks in muZero are trained directly from rewards, yet this is a very noisy and sparse signal for these networks. The authors reason that they need to provide some extra supervision to train these better. What they do is add a loss term that tells the networks that G(H(ot),at) should be very close to H(ot+1). In effect they want the predicted next state ^st+1 to be very similar to the state st+1 that in fact occurs. This is a rather …obvious… addition, of course your prediction network should produce predictions that actually occur. This provides additional gradient signal to train the representation and dynamics networks, and is what the similarity loss refers to.
Thank you for the explanation!
I am confused. Training your “dynamics network” as a predictor is precisely training that G(H(ot),at) should be very close to H(ot+1). (Or rather, as you mention, you’re minimizing the difference between st+1=H(ot+1), and ^st+1=G(st,at)=G(H(ot),at)…) How can you add a loss term that’s already present? (Or, if you’re not training your predictor by comparing predicted with actual and back-propagating error, how are you training it?)
Or are you saying that this is training the combined G(H(ot),at) network, including back-propagation of (part of) the error into updating the weights of H(ot) also? If so, that makes sense. Makes you wonder where else people feed the output of one NN into the input of another NN without back-propagation…
MuZero is training the predictor by using a neural network in the “place in the algorithm where a predictor function would go”, and then training the parameters of that network by backpropagating rewards through the network. So in MuZero the “predictor network” is not explicitely trained as you would think a predictor would be, it’s only guided by rewards, the predictor network only knows that it should produce stuff that gives more reward, there’s no sense of being a good predictor for its own sake. And the advance of this paper is to say “what if the place in our algorithm where a predictor function should go was actually trained like a predictor?”. Check out equation 6 on page 15 of the paper.
I wonder how they prevent the latent state representation of observations from collapsing into a zero-vector, thus becoming completely uninformative and trivially predictable. And if this was the reason MuZero did things its way.
There is a term in the loss function reflecting the disparity between observed rewards and rewards predicted from the state sequence (first term of Lt(θ) in equation (6)). If the state representation collapsed it would be impossible to predict rewards from it. The third term in the loss function would also punish you: it compares the value computed from the state to a linear combination of rewards and the value computed from the state at a different step (see equation (4) for definition of zt).
Oh I see, did I misunderstand point 1. from Razied then or was it mistaken? I thought H and G were trained separately with Lsimilarity
No, they are training all the networks together. The original MuZero didn’t have Lsimilarity, it learned the dynamics only via the reward-prediction terms.
Personal opinion:
Progress in model-based RL is far more relevant to getting us closer to AGI than other fields like NLP or image recognition or neuroscience or ML hardware. I worry that once the research community shifts its focus towards RL, the AGI timeline will collapse—not necessarily because there are no more critical insights left to be discovered, but because it’s fundamentally the right path to work on and whatever obstacles remain will buckle quickly once we throw enough warm bodies at them. I think—and this is highly controversial—that the focus on NLP and Vision Transformer has served as a distraction for a couple of years and actually delayed progress towards AGI.
If curiosity-driven exploration gets thrown into the mix and Starcraft/Dota gets solved (for real this time) with comparable data efficiency as humans, that would be a shrieking fire alarm to me (but not to many other people I imagine, as “this has all been done before”).
Isn’t this paper already a shrieking fire alarm?
(1) Same architecture and hyperparameters, trained separately on every game.
(4) It might work. In fact they also tested it on a benchmark that involves controlling a robot in a simulation, and showed it beats state-of-the-art on the same amount of training data (but there is no “human performance” to compare to).
(5) The poor sample complexity was one of the strongest arguments for why deep learning is not enough for AGI. So, this is a significant update in the direction of “we don’t need that many more new ideas to reach AGI”. Another implication is that model-based RL seems to be pulling way ahead of model-free RL.
It is a different net for each game. That is why they compare with DQN, not Agent57.
To train an Atari agent for 100k steps, it only needs 4 GPUs to train 7 hours.
The entire architecture is described in the Appendix A.1 Models and Hyper-parameters.
Yes.
This algorithm is more sample-efficient than humans, so it learned a specific game faster than a human could. This is definitely a huge breakthrough.
Do you have a source for Agent57 using the same network weights for all games?
I don’t think it does, and reskimming the paper I don’t see any claim it does (using a single network seems to have been largely neglected since Popart). Prabhu might be thinking of how it uses a single fixed network architecture & set of hyperparameters across all games (which while showing generality, doesn’t give any transfer learning or anything).