Razied comments on EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised

Razied 5 Nov 2021 12:17 UTC
6 points
Ah, I understand your confusion, the similarity loss they add to the RL loss function has nothing to do with exploration. It’s not meant to encourage the network to explore less “similar” states, and so is not affected by the noisy-TV problem.
The similarity loss refers to the fact that they are training a “representation network” that takes the raw pixels $o_{t}$ and produces a vector that they take as their state, $s_{t} = H (o_{t})$ , they also train a “dynamics network” that predicts $s_{t + 1}$ from $s_{t}$ and $a_{t}$ , ${^s}_{t + 1} = G (s_{t}, a_{t})$ . These networks in muZero are trained directly from rewards, yet this is a very noisy and sparse signal for these networks. The authors reason that they need to provide some extra supervision to train these better. What they do is add a loss term that tells the networks that $G (H (o_{t}), a_{t})$ should be very close to $H (o_{t + 1})$ . In effect they want the predicted next state ${^s}_{t + 1}$ to be very similar to the state $s_{t + 1}$ that in fact occurs. This is a rather …obvious… addition, of course your prediction network should produce predictions that actually occur. This provides additional gradient signal to train the representation and dynamics networks, and is what the similarity loss refers to.
- TLW 11 Nov 2021 4:36 UTC
  4 points
  Parent
  Thank you for the explanation!
  What they do is add a loss term that tells the networks that $G (H (o_{t}), a_{t})$ should be very close to $H (o_{t + 1})$ .
  I am confused. Training your “dynamics network” as a predictor is precisely training that $G (H (o_{t}), a_{t})$ should be very close to $H (o_{t + 1})$ . (Or rather, as you mention, you’re minimizing the difference between $s_{t + 1} = H (o_{t + 1})$ , and ${^s}_{t + 1} = G (s_{t}, a_{t}) = G (H (o_{t}), a_{t})$ …) How can you add a loss term that’s already present? (Or, if you’re not training your predictor by comparing predicted with actual and back-propagating error, how are you training it?)
  Or are you saying that this is training the combined $G (H (o_{t}), a_{t})$ network, including back-propagation of (part of) the error into updating the weights of $H (o_{t})$ also? If so, that makes sense. Makes you wonder where else people feed the output of one NN into the input of another NN without back-propagation…
  - Razied 12 Nov 2021 2:59 UTC
    3 points
    Parent
    Or, if you’re not training your predictor by comparing predicted with actual and back-propagating error, how are you training it?
    MuZero is training the predictor by using a neural network in the “place in the algorithm where a predictor function would go”, and then training the parameters of that network by backpropagating rewards through the network. So in MuZero the “predictor network” is not explicitely trained as you would think a predictor would be, it’s only guided by rewards, the predictor network only knows that it should produce stuff that gives more reward, there’s no sense of being a good predictor for its own sake. And the advance of this paper is to say “what if the place in our algorithm where a predictor function should go was actually trained like a predictor?”. Check out equation 6 on page 15 of the paper.

Razied comments on EfficientZero: human ALE sample-efficiency w/​MuZero+self-supervised

Razied comments on EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised