(Fair warning: I’m definitely in the “amateur” category here. Usual caveats apply—using incorrect terminology, etc, etc. Feel free to correct me.)
> they in fact add a similarity loss to the loss function of MuZero that provides extra supervision
How do they prevent noise traps? That is, picture a maze with featureless grey walls except for a wall that displays TV static. Black and white noise. Most of the rest of the maze looks very similar—but said wall is always very different. The agent ends up penalized for not sitting and watching the TV forever (or encouraged to sit and watch the TV forever. Same difference up to a constant...)
Ah, I understand your confusion, the similarity loss they add to the RL loss function has nothing to do with exploration. It’s not meant to encourage the network to explore less “similar” states, and so is not affected by the noisy-TV problem.
The similarity loss refers to the fact that they are training a “representation network” that takes the raw pixels ot and produces a vector that they take as their state, st=H(ot), they also train a “dynamics network” that predicts st+1 from st and at , ^st+1=G(st,at) . These networks in muZero are trained directly from rewards, yet this is a very noisy and sparse signal for these networks. The authors reason that they need to provide some extra supervision to train these better. What they do is add a loss term that tells the networks that G(H(ot),at) should be very close to H(ot+1). In effect they want the predicted next state ^st+1 to be very similar to the state st+1 that in fact occurs. This is a rather …obvious… addition, of course your prediction network should produce predictions that actually occur. This provides additional gradient signal to train the representation and dynamics networks, and is what the similarity loss refers to.
What they do is add a loss term that tells the networks that G(H(ot),at) should be very close to H(ot+1).
I am confused. Training your “dynamics network” as a predictor is precisely training that G(H(ot),at) should be very close to H(ot+1). (Or rather, as you mention, you’re minimizing the difference between st+1=H(ot+1), and ^st+1=G(st,at)=G(H(ot),at)…) How can you add a loss term that’s already present? (Or, if you’re not training your predictor by comparing predicted with actual and back-propagating error, how are you training it?)
Or are you saying that this is training the combined G(H(ot),at) network, including back-propagation of (part of) the error into updating the weights of H(ot) also? If so, that makes sense. Makes you wonder where else people feed the output of one NN into the input of another NN without back-propagation…
Or, if you’re not training your predictor by comparing predicted with actual and back-propagating error, how are you training it?
MuZero is training the predictor by using a neural network in the “place in the algorithm where a predictor function would go”, and then training the parameters of that network by backpropagating rewards through the network. So in MuZero the “predictor network” is not explicitely trained as you would think a predictor would be, it’s only guided by rewards, the predictor network only knows that it should produce stuff that gives more reward, there’s no sense of being a good predictor for its own sake. And the advance of this paper is to say “what if the place in our algorithm where a predictor function should go was actually trained like a predictor?”. Check out equation 6 on page 15 of the paper.
(Fair warning: I’m definitely in the “amateur” category here. Usual caveats apply—using incorrect terminology, etc, etc. Feel free to correct me.)
> they in fact add a similarity loss to the loss function of MuZero that provides extra supervision
How do they prevent noise traps? That is, picture a maze with featureless grey walls except for a wall that displays TV static. Black and white noise. Most of the rest of the maze looks very similar—but said wall is always very different. The agent ends up penalized for not sitting and watching the TV forever (or encouraged to sit and watch the TV forever. Same difference up to a constant...)
Ah, I understand your confusion, the similarity loss they add to the RL loss function has nothing to do with exploration. It’s not meant to encourage the network to explore less “similar” states, and so is not affected by the noisy-TV problem.
The similarity loss refers to the fact that they are training a “representation network” that takes the raw pixels ot and produces a vector that they take as their state, st=H(ot), they also train a “dynamics network” that predicts st+1 from st and at , ^st+1=G(st,at) . These networks in muZero are trained directly from rewards, yet this is a very noisy and sparse signal for these networks. The authors reason that they need to provide some extra supervision to train these better. What they do is add a loss term that tells the networks that G(H(ot),at) should be very close to H(ot+1). In effect they want the predicted next state ^st+1 to be very similar to the state st+1 that in fact occurs. This is a rather …obvious… addition, of course your prediction network should produce predictions that actually occur. This provides additional gradient signal to train the representation and dynamics networks, and is what the similarity loss refers to.
Thank you for the explanation!
I am confused. Training your “dynamics network” as a predictor is precisely training that G(H(ot),at) should be very close to H(ot+1). (Or rather, as you mention, you’re minimizing the difference between st+1=H(ot+1), and ^st+1=G(st,at)=G(H(ot),at)…) How can you add a loss term that’s already present? (Or, if you’re not training your predictor by comparing predicted with actual and back-propagating error, how are you training it?)
Or are you saying that this is training the combined G(H(ot),at) network, including back-propagation of (part of) the error into updating the weights of H(ot) also? If so, that makes sense. Makes you wonder where else people feed the output of one NN into the input of another NN without back-propagation…
MuZero is training the predictor by using a neural network in the “place in the algorithm where a predictor function would go”, and then training the parameters of that network by backpropagating rewards through the network. So in MuZero the “predictor network” is not explicitely trained as you would think a predictor would be, it’s only guided by rewards, the predictor network only knows that it should produce stuff that gives more reward, there’s no sense of being a good predictor for its own sake. And the advance of this paper is to say “what if the place in our algorithm where a predictor function should go was actually trained like a predictor?”. Check out equation 6 on page 15 of the paper.