Teja Prabhu comments on EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised

Teja Prabhu 2 Nov 2021 13:16 UTC
4 points
1. It is a different net for each game. That is why they compare with DQN, not Agent57.
2. To train an Atari agent for 100k steps, it only needs 4 GPUs to train 7 hours.
3. The entire architecture is described in the Appendix A.1 Models and Hyper-parameters.
4. Yes.
5. This algorithm is more sample-efficient than humans, so it learned a specific game faster than a human could. This is definitely a huge breakthrough.
- axioman 4 Nov 2021 23:45 UTC
  3 points
  Parent
  Do you have a source for Agent57 using the same network weights for all games?
  - gwern 5 Nov 2021 1:19 UTC
    4 points
    Parent
    I don’t think it does, and reskimming the paper I don’t see any claim it does (using a single network seems to have been largely neglected since Popart). Prabhu might be thinking of how it uses a single fixed network architecture & set of hyperparameters across all games (which while showing generality, doesn’t give any transfer learning or anything).