What they do is add a loss term that tells the networks that G(H(ot),at) should be very close to H(ot+1).
I am confused. Training your “dynamics network” as a predictor is precisely training that G(H(ot),at) should be very close to H(ot+1). (Or rather, as you mention, you’re minimizing the difference between st+1=H(ot+1), and ^st+1=G(st,at)=G(H(ot),at)…) How can you add a loss term that’s already present? (Or, if you’re not training your predictor by comparing predicted with actual and back-propagating error, how are you training it?)
Or are you saying that this is training the combined G(H(ot),at) network, including back-propagation of (part of) the error into updating the weights of H(ot) also? If so, that makes sense. Makes you wonder where else people feed the output of one NN into the input of another NN without back-propagation…
Or, if you’re not training your predictor by comparing predicted with actual and back-propagating error, how are you training it?
MuZero is training the predictor by using a neural network in the “place in the algorithm where a predictor function would go”, and then training the parameters of that network by backpropagating rewards through the network. So in MuZero the “predictor network” is not explicitely trained as you would think a predictor would be, it’s only guided by rewards, the predictor network only knows that it should produce stuff that gives more reward, there’s no sense of being a good predictor for its own sake. And the advance of this paper is to say “what if the place in our algorithm where a predictor function should go was actually trained like a predictor?”. Check out equation 6 on page 15 of the paper.
Thank you for the explanation!
I am confused. Training your “dynamics network” as a predictor is precisely training that G(H(ot),at) should be very close to H(ot+1). (Or rather, as you mention, you’re minimizing the difference between st+1=H(ot+1), and ^st+1=G(st,at)=G(H(ot),at)…) How can you add a loss term that’s already present? (Or, if you’re not training your predictor by comparing predicted with actual and back-propagating error, how are you training it?)
Or are you saying that this is training the combined G(H(ot),at) network, including back-propagation of (part of) the error into updating the weights of H(ot) also? If so, that makes sense. Makes you wonder where else people feed the output of one NN into the input of another NN without back-propagation…
MuZero is training the predictor by using a neural network in the “place in the algorithm where a predictor function would go”, and then training the parameters of that network by backpropagating rewards through the network. So in MuZero the “predictor network” is not explicitely trained as you would think a predictor would be, it’s only guided by rewards, the predictor network only knows that it should produce stuff that gives more reward, there’s no sense of being a good predictor for its own sake. And the advance of this paper is to say “what if the place in our algorithm where a predictor function should go was actually trained like a predictor?”. Check out equation 6 on page 15 of the paper.