I would guess that eliminating the fancy policy network (and spending ~40x more compute on search—not 361x, because presumably you search over several branches suggested by the policy) would in fact improve performance.
I would guess that the policy network still outperforms. Not based on any deep theoretical knowledge, just based on “I expect someone at deepmind tried that, and if it had worked I would expect to see something about it in one of the appendices”.
Probably worth actually trying out though, since KataGo exists.
I would guess that the policy network still outperforms.
I agree with this. If you look at Figure 5 in the paper, 5d specifically, you can get an idea of what the policy network is doing. The policy network’s prior on what ends up being the best move is 35% (~1/3), which is a lot higher than the 1⁄361 a uniform prior would give it. If you assume this is average, this policy network would give ~120x linear speed-up in search. And this is assuming no exploration (i.e. the value network is perfect). Including exploration, I think the policy network would give exponential increases in speed.
Edit: Looking through the paper a bit more, they actually use a “tree policy”, not a uniform prior, to calculate priors when “not using” the policy network, so what I said above isn’t entirely correct. Using 40x the compute with this tree policy would probably (?) outperform the SL policy network, but I think the extra compute used on 40x the search would massively outweigh the compute saved by using the tree policy instead of the SL policy network. The value network uses a similar architecture to the policy network and a forward pass of both are run on every MCTS expansion, so you would be saving ~half the compute for each expansion to do ~40x the expansions.
I would guess that the policy network still outperforms. Not based on any deep theoretical knowledge, just based on “I expect someone at deepmind tried that, and if it had worked I would expect to see something about it in one of the appendices”.
Probably worth actually trying out though, since KataGo exists.
I agree with this. If you look at Figure 5 in the paper, 5d specifically, you can get an idea of what the policy network is doing. The policy network’s prior on what ends up being the best move is 35% (~1/3), which is a lot higher than the 1⁄361 a uniform prior would give it. If you assume this is average, this policy network would give ~120x linear speed-up in search. And this is assuming no exploration (i.e. the value network is perfect). Including exploration, I think the policy network would give exponential increases in speed.
Edit: Looking through the paper a bit more, they actually use a “tree policy”, not a uniform prior, to calculate priors when “not using” the policy network, so what I said above isn’t entirely correct. Using 40x the compute with this tree policy would probably (?) outperform the SL policy network, but I think the extra compute used on 40x the search would massively outweigh the compute saved by using the tree policy instead of the SL policy network. The value network uses a similar architecture to the policy network and a forward pass of both are run on every MCTS expansion, so you would be saving ~half the compute for each expansion to do ~40x the expansions.