maximkazhenkov comments on EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised

maximkazhenkov 4 Nov 2021 6:05 UTC
LW: 1 AF: 1
AF
Learn the environment dynamics by self-supervision instead of relying only on reward signals. Meaning that they don’t learn the dynamics end-to-end like in MuZero. For them the loss function for the enviroment dynamics is completely separate from the RL loss function.
I wonder how they prevent the latent state representation of observations from collapsing into a zero-vector, thus becoming completely uninformative and trivially predictable. And if this was the reason MuZero did things its way.
- Vanessa Kosoy 4 Nov 2021 8:31 UTC
  LW: 3 AF: 2
  AF Parent
  There is a term in the loss function reflecting the disparity between observed rewards and rewards predicted from the state sequence (first term of $L_{t} (θ)$ in equation (6)). If the state representation collapsed it would be impossible to predict rewards from it. The third term in the loss function would also punish you: it compares the value computed from the state to a linear combination of rewards and the value computed from the state at a different step (see equation (4) for definition of $z_{t}$ ).
  - maximkazhenkov 4 Nov 2021 10:53 UTC
    LW: 3 AF: 2
    AF Parent
    Oh I see, did I misunderstand point 1. from Razied then or was it mistaken? I thought $H$ and $G$ were trained separately with $L_{similarity}$
    - Vanessa Kosoy 4 Nov 2021 11:47 UTC
      LW: 5 AF: 1
      AF Parent
      No, they are training all the networks together. The original MuZero didn’t have $L_{similarity}$ , it learned the dynamics only via the reward-prediction terms.

maximkazhenkov comments on EfficientZero: human ALE sample-efficiency w/​MuZero+self-supervised

maximkazhenkov comments on EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised