Ankesh Anand comments on EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised

Ankesh Anand 14 Nov 2021 3:45 UTC
LW: 6 AF: 5
AF
The Q-Learning baseline is a model-free control of MuZero. So it shares implementation details of MuZero (network architecture, replay ratio, training details etc.) while removing the model-based components of MuZero (details in sec A.2) . Some key differences you’d find vs a typical Q-learning implementation:
- Larger network architectures: 10 block ResNet compared to a few conv layers in typical implementations.
- Higher sample reuse: When using a reanalyse ratio of 0.95, both MuZero and Q-Learning use each replay buffer sample an average of 20 times. The target network is updated every 100 training steps.
- Batch size of 1024 and some smaller details like using categorical reward and value predictions similar to MuZero.
- We also have a small model-based component which predicts reward at next time step which lets us decompose the Q(s,a) into reward and value predictions just like MuZero.
I would guess larger networks + higher sample reuse have the biggest effect size compared to standard Q-learning implementations.
The ProcGen competition also might have used the easy difficulty mode compared to the hard difficulty mode used in our paper.
- John Schulman 14 Nov 2021 18:02 UTC
  LW: 3 AF: 3
  AF Parent
  Thanks, this is very insightful. BTW, I think your paper is excellent!
  - Ankesh Anand 14 Nov 2021 21:26 UTC
    1 point
    AF Parent
    Thanks, glad you liked it, I really like the recent RL directions from OpenAI too! It would be interesting to see the use of model-based RL for the “RL as fine-tuning paradigm”: making large pre-trained models more aligned/goal-directed efficiently by simply searching over a reward function learned from humans.
    - John Schulman 19 Nov 2021 8:59 UTC
      LW: 3 AF: 2
      AF Parent
      Would you say Learning to Summarize is an example of this? https://arxiv.org/abs/2009.01325
      It’s model based RL because you’re optimizing against the model of the human (ie the reward model). And there are some results at the end on test-time search.
      Or do you have something else in mind?

Ankesh Anand comments on EfficientZero: human ALE sample-efficiency w/​MuZero+self-supervised

Ankesh Anand comments on EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised