Edit: Actually I’m still confused. If I’m reading the paper correctly, the SL policy network is trained to predict what the RL network would do, not to do the thing which maximizes value of information. I’d be pretty surprised if those ended up being the same thing as each other.
The SL policy network isn’t trained on any data from the RL policy network, just on predicting the next move in expert games.
The value network is what is trained on data from the RL policy network. It predicts if the RL policy network would win or lose from a certain position.
The SL policy network isn’t trained on any data from the RL policy network, just on predicting the next move in expert games.
The value network is what is trained on data from the RL policy network. It predicts if the RL policy network would win or lose from a certain position.