Vanessa Kosoy comments on EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised

Vanessa Kosoy 4 Nov 2021 8:31 UTC
LW: 3 AF: 2
AF
There is a term in the loss function reflecting the disparity between observed rewards and rewards predicted from the state sequence (first term of $L_{t} (θ)$ in equation (6)). If the state representation collapsed it would be impossible to predict rewards from it. The third term in the loss function would also punish you: it compares the value computed from the state to a linear combination of rewards and the value computed from the state at a different step (see equation (4) for definition of $z_{t}$ ).
- maximkazhenkov 4 Nov 2021 10:53 UTC
  LW: 3 AF: 2
  AF Parent
  Oh I see, did I misunderstand point 1. from Razied then or was it mistaken? I thought $H$ and $G$ were trained separately with $L_{similarity}$
  - Vanessa Kosoy 4 Nov 2021 11:47 UTC
    LW: 5 AF: 1
    AF Parent
    No, they are training all the networks together. The original MuZero didn’t have $L_{similarity}$ , it learned the dynamics only via the reward-prediction terms.