I would guess that the policy network still outperforms.
I agree with this. If you look at Figure 5 in the paper, 5d specifically, you can get an idea of what the policy network is doing. The policy network’s prior on what ends up being the best move is 35% (~1/3), which is a lot higher than the 1⁄361 a uniform prior would give it. If you assume this is average, this policy network would give ~120x linear speed-up in search. And this is assuming no exploration (i.e. the value network is perfect). Including exploration, I think the policy network would give exponential increases in speed.
Edit: Looking through the paper a bit more, they actually use a “tree policy”, not a uniform prior, to calculate priors when “not using” the policy network, so what I said above isn’t entirely correct. Using 40x the compute with this tree policy would probably (?) outperform the SL policy network, but I think the extra compute used on 40x the search would massively outweigh the compute saved by using the tree policy instead of the SL policy network. The value network uses a similar architecture to the policy network and a forward pass of both are run on every MCTS expansion, so you would be saving ~half the compute for each expansion to do ~40x the expansions.
I agree with this. If you look at Figure 5 in the paper, 5d specifically, you can get an idea of what the policy network is doing. The policy network’s prior on what ends up being the best move is 35% (~1/3), which is a lot higher than the 1⁄361 a uniform prior would give it. If you assume this is average, this policy network would give ~120x linear speed-up in search. And this is assuming no exploration (i.e. the value network is perfect). Including exploration, I think the policy network would give exponential increases in speed.
Edit: Looking through the paper a bit more, they actually use a “tree policy”, not a uniform prior, to calculate priors when “not using” the policy network, so what I said above isn’t entirely correct. Using 40x the compute with this tree policy would probably (?) outperform the SL policy network, but I think the extra compute used on 40x the search would massively outweigh the compute saved by using the tree policy instead of the SL policy network. The value network uses a similar architecture to the policy network and a forward pass of both are run on every MCTS expansion, so you would be saving ~half the compute for each expansion to do ~40x the expansions.