John Schulman comments on EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised

John Schulman Nov 12, 2021, 10:06 AM
LW: 4 AF: 4
AF
I’m still not sure how to reconcile your results with the fact that the participants in the procgen contest ended up winning with modifications of our PPO/PPG baselines, rather than Q-learning and other value-based algorithms, whereas your paper suggests that Q-learning performs much better. The contest used 8M timesteps + 200 levels. I assume that your “QL” baseline is pretty similar to widespread DQN implementations.
https://arxiv.org/pdf/2103.15332.pdf
https://www.aicrowd.com/challenges/neurips-2020-procgen-competition/leaderboards?challenge_leaderboard_extra_id=470&challenge_round_id=662
Are there implementation level changes that dramatically improve performance of your QL implementation?
(Currently on vacation and I read your paper briefly while traveling, but I may very well have missed something.)
- Ankesh Anand Nov 14, 2021, 3:45 AM
  LW: 6 AF: 5
  AF Parent
  The Q-Learning baseline is a model-free control of MuZero. So it shares implementation details of MuZero (network architecture, replay ratio, training details etc.) while removing the model-based components of MuZero (details in sec A.2) . Some key differences you’d find vs a typical Q-learning implementation:
  - Larger network architectures: 10 block ResNet compared to a few conv layers in typical implementations.
  - Higher sample reuse: When using a reanalyse ratio of 0.95, both MuZero and Q-Learning use each replay buffer sample an average of 20 times. The target network is updated every 100 training steps.
  - Batch size of 1024 and some smaller details like using categorical reward and value predictions similar to MuZero.
  - We also have a small model-based component which predicts reward at next time step which lets us decompose the Q(s,a) into reward and value predictions just like MuZero.
  I would guess larger networks + higher sample reuse have the biggest effect size compared to standard Q-learning implementations.
  The ProcGen competition also might have used the easy difficulty mode compared to the hard difficulty mode used in our paper.
  - John Schulman Nov 14, 2021, 6:02 PM
    LW: 3 AF: 3
    AF Parent
    Thanks, this is very insightful. BTW, I think your paper is excellent!
    - Ankesh Anand Nov 14, 2021, 9:26 PM
      1 point
      AF Parent
      Thanks, glad you liked it, I really like the recent RL directions from OpenAI too! It would be interesting to see the use of model-based RL for the “RL as fine-tuning paradigm”: making large pre-trained models more aligned/goal-directed efficiently by simply searching over a reward function learned from humans.
      - John Schulman Nov 19, 2021, 8:59 AM
        LW: 3 AF: 2
        AF Parent
        Would you say Learning to Summarize is an example of this? https://arxiv.org/abs/2009.01325
        It’s model based RL because you’re optimizing against the model of the human (ie the reward model). And there are some results at the end on test-time search.
        Or do you have something else in mind?

John Schulman comments on EfficientZero: human ALE sample-efficiency w/​MuZero+self-supervised

John Schulman comments on EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised