John Schulman comments on EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised

John Schulman 19 Nov 2021 8:59 UTC
LW: 3 AF: 2
AF
Would you say Learning to Summarize is an example of this? https://arxiv.org/abs/2009.01325
It’s model based RL because you’re optimizing against the model of the human (ie the reward model). And there are some results at the end on test-time search.
Or do you have something else in mind?