Charlie Steiner answers Is AlphaGo actually a consequentialist utility maximizer?

Charlie Steiner 7 Dec 2023 18:09 UTC
8 points
0
I think we shouldn’t read all that much into AlphaGo given that it’s outperformed by AlphaZero/MuZero.
Also, I think the main probability is you misread the paper. I bet the ablation analysis takes out the policy network used to decide on good moves during search (not just the surface-level policy network used at the top of the tree), and they used the same amount of compute in the ablations (i.e. they reduced the search depth rather than doing brute-force search to the same depth).
I would guess that eliminating the fancy policy network (and spending ~40x more compute on search—not 361x, because presumably you search over several branches suggested by the policy) would in fact improve performance.
- faul_sname 7 Dec 2023 19:38 UTC
  5 points
  2
  Parent
  
  I would guess that eliminating the fancy policy network (and spending ~40x more compute on search—not 361x, because presumably you search over several branches suggested by the policy) would in fact improve performance.
  
  I would guess that the policy network still outperforms. Not based on any deep theoretical knowledge, just based on “I expect someone at deepmind tried that, and if it had worked I would expect to see something about it in one of the appendices”.
  
  Probably worth actually trying out though, since KataGo exists.
  - jco 8 Dec 2023 15:58 UTC
    3 points
    0
    Parent
    I would guess that the policy network still outperforms.
    I agree with this. If you look at Figure 5 in the paper, 5d specifically, you can get an idea of what the policy network is doing. The policy network’s prior on what ends up being the best move is 35% (~1/3), which is a lot higher than the ¹⁄₃₆₁ a uniform prior would give it. If you assume this is average, this policy network would give ~120x linear speed-up in search. And this is assuming no exploration (i.e. the value network is perfect). Including exploration, I think the policy network would give exponential increases in speed.
    Edit: Looking through the paper a bit more, they actually use a “tree policy”, not a uniform prior, to calculate priors when “not using” the policy network, so what I said above isn’t entirely correct. Using 40x the compute with this tree policy would probably (?) outperform the SL policy network, but I think the extra compute used on 40x the search would massively outweigh the compute saved by using the tree policy instead of the SL policy network. The value network uses a similar architecture to the policy network and a forward pass of both are run on every MCTS expansion, so you would be saving ~half the compute for each expansion to do ~40x the expansions.