Oh I see, did I misunderstand point 1. from Razied then or was it mistaken? I thought H and G were trained separately with Lsimilarity
No, they are training all the networks together. The original MuZero didn’t have Lsimilarity, it learned the dynamics only via the reward-prediction terms.
Oh I see, did I misunderstand point 1. from Razied then or was it mistaken? I thought H and G were trained separately with Lsimilarity
No, they are training all the networks together. The original MuZero didn’t have Lsimilarity, it learned the dynamics only via the reward-prediction terms.