Kibidango comments on Iterated Distillation and Amplification

Kibidango 30 Nov 2018 17:48 UTC
LW: 3 AF: 2
AF
AGZ’s policy network p is the learned model.
I found this bit slightly confusing. As far as I understand from the AGZ Nature paper, AGZ does not have a separate policy network p, but uses a single network $f_{θ}$ which outputs both the learned policy p and the estimated probability v that the current player will win the game. Is this what the sentence is referring to?
- paulfchristiano 30 Nov 2018 20:39 UTC
  LW: 6 AF: 4
  AF Parent
  Yes, AGZ uses the same network for policy and value function.