There is a term in the loss function reflecting the disparity between observed rewards and rewards predicted from the state sequence (first term of Lt(θ) in equation (6)). If the state representation collapsed it would be impossible to predict rewards from it. The third term in the loss function would also punish you: it compares the value computed from the state to a linear combination of rewards and the value computed from the state at a different step (see equation (4) for definition of zt).
No, they are training all the networks together. The original MuZero didn’t have Lsimilarity, it learned the dynamics only via the reward-prediction terms.
There is a term in the loss function reflecting the disparity between observed rewards and rewards predicted from the state sequence (first term of Lt(θ) in equation (6)). If the state representation collapsed it would be impossible to predict rewards from it. The third term in the loss function would also punish you: it compares the value computed from the state to a linear combination of rewards and the value computed from the state at a different step (see equation (4) for definition of zt).
Oh I see, did I misunderstand point 1. from Razied then or was it mistaken? I thought H and G were trained separately with Lsimilarity
No, they are training all the networks together. The original MuZero didn’t have Lsimilarity, it learned the dynamics only via the reward-prediction terms.