Learn the environment dynamics by self-supervision instead of relying only on reward signals. Meaning that they don’t learn the dynamics end-to-end like in MuZero. For them the loss function for the enviroment dynamics is completely separate from the RL loss function.
I wonder how they prevent the latent state representation of observations from collapsing into a zero-vector, thus becoming completely uninformative and trivially predictable. And if this was the reason MuZero did things its way.
There is a term in the loss function reflecting the disparity between observed rewards and rewards predicted from the state sequence (first term of Lt(θ) in equation (6)). If the state representation collapsed it would be impossible to predict rewards from it. The third term in the loss function would also punish you: it compares the value computed from the state to a linear combination of rewards and the value computed from the state at a different step (see equation (4) for definition of zt).
No, they are training all the networks together. The original MuZero didn’t have Lsimilarity, it learned the dynamics only via the reward-prediction terms.
I wonder how they prevent the latent state representation of observations from collapsing into a zero-vector, thus becoming completely uninformative and trivially predictable. And if this was the reason MuZero did things its way.
There is a term in the loss function reflecting the disparity between observed rewards and rewards predicted from the state sequence (first term of Lt(θ) in equation (6)). If the state representation collapsed it would be impossible to predict rewards from it. The third term in the loss function would also punish you: it compares the value computed from the state to a linear combination of rewards and the value computed from the state at a different step (see equation (4) for definition of zt).
Oh I see, did I misunderstand point 1. from Razied then or was it mistaken? I thought H and G were trained separately with Lsimilarity
No, they are training all the networks together. The original MuZero didn’t have Lsimilarity, it learned the dynamics only via the reward-prediction terms.