TLW comments on EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised

TLW 5 Nov 2021 4:57 UTC
4 points
(Fair warning: I’m definitely in the “amateur” category here. Usual caveats apply—using incorrect terminology, etc, etc. Feel free to correct me.)

> they in fact add a similarity loss to the loss function of MuZero that provides extra supervision

How do they prevent noise traps? That is, picture a maze with featureless grey walls except for a wall that displays TV static. Black and white noise. Most of the rest of the maze looks very similar—but said wall is always very different. The agent ends up penalized for not sitting and watching the TV forever (or encouraged to sit and watch the TV forever. Same difference up to a constant...)
- Razied 5 Nov 2021 12:17 UTC
  6 points
  Parent
  Ah, I understand your confusion, the similarity loss they add to the RL loss function has nothing to do with exploration. It’s not meant to encourage the network to explore less “similar” states, and so is not affected by the noisy-TV problem.
  The similarity loss refers to the fact that they are training a “representation network” that takes the raw pixels $o_{t}$ and produces a vector that they take as their state, $s_{t} = H (o_{t})$ , they also train a “dynamics network” that predicts $s_{t + 1}$ from $s_{t}$ and $a_{t}$ , ${^s}_{t + 1} = G (s_{t}, a_{t})$ . These networks in muZero are trained directly from rewards, yet this is a very noisy and sparse signal for these networks. The authors reason that they need to provide some extra supervision to train these better. What they do is add a loss term that tells the networks that $G (H (o_{t}), a_{t})$ should be very close to $H (o_{t + 1})$ . In effect they want the predicted next state ${^s}_{t + 1}$ to be very similar to the state $s_{t + 1}$ that in fact occurs. This is a rather …obvious… addition, of course your prediction network should produce predictions that actually occur. This provides additional gradient signal to train the representation and dynamics networks, and is what the similarity loss refers to.
  - TLW 11 Nov 2021 4:36 UTC
    4 points
    Parent
    Thank you for the explanation!
    What they do is add a loss term that tells the networks that $G (H (o_{t}), a_{t})$ should be very close to $H (o_{t + 1})$ .
    I am confused. Training your “dynamics network” as a predictor is precisely training that $G (H (o_{t}), a_{t})$ should be very close to $H (o_{t + 1})$ . (Or rather, as you mention, you’re minimizing the difference between $s_{t + 1} = H (o_{t + 1})$ , and ${^s}_{t + 1} = G (s_{t}, a_{t}) = G (H (o_{t}), a_{t})$ …) How can you add a loss term that’s already present? (Or, if you’re not training your predictor by comparing predicted with actual and back-propagating error, how are you training it?)
    Or are you saying that this is training the combined $G (H (o_{t}), a_{t})$ network, including back-propagation of (part of) the error into updating the weights of $H (o_{t})$ also? If so, that makes sense. Makes you wonder where else people feed the output of one NN into the input of another NN without back-propagation…
    - Razied 12 Nov 2021 2:59 UTC
      3 points
      Parent
      Or, if you’re not training your predictor by comparing predicted with actual and back-propagating error, how are you training it?
      MuZero is training the predictor by using a neural network in the “place in the algorithm where a predictor function would go”, and then training the parameters of that network by backpropagating rewards through the network. So in MuZero the “predictor network” is not explicitely trained as you would think a predictor would be, it’s only guided by rewards, the predictor network only knows that it should produce stuff that gives more reward, there’s no sense of being a good predictor for its own sake. And the advance of this paper is to say “what if the place in our algorithm where a predictor function should go was actually trained like a predictor?”. Check out equation 6 on page 15 of the paper.

TLW comments on EfficientZero: human ALE sample-efficiency w/​MuZero+self-supervised

TLW comments on EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised