Edit: Actually I’m still confused. If I’m reading the paper correctly, the SL policy network is trained to predict what the RL network would do, not to do the thing which maximizes value of information. I’d be pretty surprised if those ended up being the same thing as each other.
The SL policy network isn’t trained on any data from the RL policy network, just on predicting the next move in expert games.
The value network is what is trained on data from the RL policy network. It predicts if the RL policy network would win or lose from a certain position.
I agree with this. If you look at Figure 5 in the paper, 5d specifically, you can get an idea of what the policy network is doing. The policy network’s prior on what ends up being the best move is 35% (~1/3), which is a lot higher than the 1⁄361 a uniform prior would give it. If you assume this is average, this policy network would give ~120x linear speed-up in search. And this is assuming no exploration (i.e. the value network is perfect). Including exploration, I think the policy network would give exponential increases in speed.
Edit: Looking through the paper a bit more, they actually use a “tree policy”, not a uniform prior, to calculate priors when “not using” the policy network, so what I said above isn’t entirely correct. Using 40x the compute with this tree policy would probably (?) outperform the SL policy network, but I think the extra compute used on 40x the search would massively outweigh the compute saved by using the tree policy instead of the SL policy network. The value network uses a similar architecture to the policy network and a forward pass of both are run on every MCTS expansion, so you would be saving ~half the compute for each expansion to do ~40x the expansions.