No, they are training all the networks together. The original MuZero didn’t have Lsimilarity, it learned the dynamics only via the reward-prediction terms.
No, they are training all the networks together. The original MuZero didn’t have Lsimilarity, it learned the dynamics only via the reward-prediction terms.