I think it’s a combination of (1) the policy network helps to narrow the search tree efficiently and (6) it’s not quite true that the policy network is just doing a cheaper version of “try all the possible moves and evaluate them all” because “what are the best moves to search?” isn’t necessarily quite the same question as “what moves look like they have the best win probability?”. (E.g., suppose there’s some class of moves that usually don’t work but occasionally turn out to be very good, and suppose the network can identify such moves but not reliably tell which they are. You probably want to search those moves even though their win probability isn’t great.)
On the latter point: In the training of AlphaGo and successors like KataGo, the training target for the policy network is something like “relative number of explorations below this node in the search tree”. (KataGo does a slightly fancy thing where during training it deliberately adds noise at the root of the tree, leading to unnaturally many visits to less-good nodes, but it also does a slightly fancy thing that tries to correct for this.) In particular, the policy network isn’t trained on estimated win rates or scores or anything like that, nor on what moves actually get played; it is trying to identify moves that when a proper search is done will turn out to be worth looking at deeply, rather than good or likely moves as such.
The expected value of exploring a node in the game tree is different from the expected value of playing that node, and the policy network can be viewed as trying to choose the move distibution with maximal value of information.
And so one could view the outputs of the SL policy network as approximating the expected utility of exploring this region of the game tree, though you couldn’t directly trade off between winning-the-game utility and exploring-this-part-of-the-game-tree utility due to the particular way AlphaGo is used (flat 5s of search per turn).
Thanks!
Edit: Actually I’m still confused. If I’m reading the paper correctly, the SL policy network is trained to predict what the RL network would do, not to do the thing which maximizes value of information. I’d be pretty surprised if those ended up being the same thing as each other.
Yes, the expected value of playing a move is not the same as the expected value of exploring it while searching. “Maximal value of information” would be nice but the training isn’t explicitly going for that (I suspect that would be difficult), but for the simpler approximation of “how much was the subtree under this node looked at in a search?”.
So the idea is: suppose that when looking at positions like this one, it turned out on average that the search spent a lot of time exploring moves like this one. Then this is probably a good move to explore.
This isn’t the same thing as value of information. (At least, I don’t think it is. Though for all I know there might be some theorems saying that for large searches it tends to be, or something?) But it’s a bit like value of information, because the searching algorithm tries to spend its time looking at moves that are useful to look at, and if it’s doing a good job of this then more-explored moves are something like higher-expected-value-of-information ones.
I’d be pretty surprised if those ended up being the same thing as each other.
Yes, there’s several levels here, and it can get confusing which one you’re fiddling with or ‘exploring’ in. This can make the (pre-DL) MCTS literature hard to read because it can get fairly baroque as you use different heuristics at each level. What is useful for making choices how to explore the tree to eventually make the optimal choice is not the same thing as just bandit sampling. (It’s like the difference between best-arm finding and simple bandit minimizing regret: you don’t want to ‘minimize long-run regret’ when you explore a MCTS tree, because there is no long run: you want to make the right choice after you’ve done a small number of rollouts, because you need to take an action, and you will start planning again.) So while you can explore a MCTS tree with something simple & convenient like Bayesian Thompson sampling (if you’re predicting win rates as a binomial, that’s conjugate and so is very fast), it won’t work as well as something deliberately targeting exploration, or one which is budget-aware and tries to maximize the probability of finding best action within n rollouts.
Edit: Actually I’m still confused. If I’m reading the paper correctly, the SL policy network is trained to predict what the RL network would do, not to do the thing which maximizes value of information. I’d be pretty surprised if those ended up being the same thing as each other.
The SL policy network isn’t trained on any data from the RL policy network, just on predicting the next move in expert games.
The value network is what is trained on data from the RL policy network. It predicts if the RL policy network would win or lose from a certain position.
I think it’s a combination of (1) the policy network helps to narrow the search tree efficiently and (6) it’s not quite true that the policy network is just doing a cheaper version of “try all the possible moves and evaluate them all” because “what are the best moves to search?” isn’t necessarily quite the same question as “what moves look like they have the best win probability?”. (E.g., suppose there’s some class of moves that usually don’t work but occasionally turn out to be very good, and suppose the network can identify such moves but not reliably tell which they are. You probably want to search those moves even though their win probability isn’t great.)
On the latter point: In the training of AlphaGo and successors like KataGo, the training target for the policy network is something like “relative number of explorations below this node in the search tree”. (KataGo does a slightly fancy thing where during training it deliberately adds noise at the root of the tree, leading to unnaturally many visits to less-good nodes, but it also does a slightly fancy thing that tries to correct for this.) In particular, the policy network isn’t trained on estimated win rates or scores or anything like that, nor on what moves actually get played; it is trying to identify moves that when a proper search is done will turn out to be worth looking at deeply, rather than good or likely moves as such.
So if I’m understanding correctly:
The expected value of exploring a node in the game tree is different from the expected value of playing that node, and the policy network can be viewed as trying to choose the move distibution with maximal value of information.
And so one could view the outputs of the SL policy network as approximating the expected utility of exploring this region of the game tree, though you couldn’t directly trade off between winning-the-game utility and exploring-this-part-of-the-game-tree utility due to the particular way AlphaGo is used (flat 5s of search per turn).
Thanks!
Edit: Actually I’m still confused. If I’m reading the paper correctly, the SL policy network is trained to predict what the RL network would do, not to do the thing which maximizes value of information. I’d be pretty surprised if those ended up being the same thing as each other.
Yes, the expected value of playing a move is not the same as the expected value of exploring it while searching. “Maximal value of information” would be nice but the training isn’t explicitly going for that (I suspect that would be difficult), but for the simpler approximation of “how much was the subtree under this node looked at in a search?”.
So the idea is: suppose that when looking at positions like this one, it turned out on average that the search spent a lot of time exploring moves like this one. Then this is probably a good move to explore.
This isn’t the same thing as value of information. (At least, I don’t think it is. Though for all I know there might be some theorems saying that for large searches it tends to be, or something?) But it’s a bit like value of information, because the searching algorithm tries to spend its time looking at moves that are useful to look at, and if it’s doing a good job of this then more-explored moves are something like higher-expected-value-of-information ones.
Yes, there’s several levels here, and it can get confusing which one you’re fiddling with or ‘exploring’ in. This can make the (pre-DL) MCTS literature hard to read because it can get fairly baroque as you use different heuristics at each level. What is useful for making choices how to explore the tree to eventually make the optimal choice is not the same thing as just bandit sampling. (It’s like the difference between best-arm finding and simple bandit minimizing regret: you don’t want to ‘minimize long-run regret’ when you explore a MCTS tree, because there is no long run: you want to make the right choice after you’ve done a small number of rollouts, because you need to take an action, and you will start planning again.) So while you can explore a MCTS tree with something simple & convenient like Bayesian Thompson sampling (if you’re predicting win rates as a binomial, that’s conjugate and so is very fast), it won’t work as well as something deliberately targeting exploration, or one which is budget-aware and tries to maximize the probability of finding best action within n rollouts.
The SL policy network isn’t trained on any data from the RL policy network, just on predicting the next move in expert games.
The value network is what is trained on data from the RL policy network. It predicts if the RL policy network would win or lose from a certain position.