The second stage of the training pipeline aims at improving the policy network by policy gradient reinforcement learning (RL).
Except that they don’t seem to use the resulting network in actual play; the only use is for deriving their state-evaluation network.
Except that they don’t seem to use the resulting network in actual play; the only use is for deriving their state-evaluation network.