The expected value of exploring a node in the game tree is different from the expected value of playing that node, and the policy network can be viewed as trying to choose the move distibution with maximal value of information.
And so one could view the outputs of the SL policy network as approximating the expected utility of exploring this region of the game tree, though you couldn’t directly trade off between winning-the-game utility and exploring-this-part-of-the-game-tree utility due to the particular way AlphaGo is used (flat 5s of search per turn).
Thanks!
Edit: Actually I’m still confused. If I’m reading the paper correctly, the SL policy network is trained to predict what the RL network would do, not to do the thing which maximizes value of information. I’d be pretty surprised if those ended up being the same thing as each other.
Yes, the expected value of playing a move is not the same as the expected value of exploring it while searching. “Maximal value of information” would be nice but the training isn’t explicitly going for that (I suspect that would be difficult), but for the simpler approximation of “how much was the subtree under this node looked at in a search?”.
So the idea is: suppose that when looking at positions like this one, it turned out on average that the search spent a lot of time exploring moves like this one. Then this is probably a good move to explore.
This isn’t the same thing as value of information. (At least, I don’t think it is. Though for all I know there might be some theorems saying that for large searches it tends to be, or something?) But it’s a bit like value of information, because the searching algorithm tries to spend its time looking at moves that are useful to look at, and if it’s doing a good job of this then more-explored moves are something like higher-expected-value-of-information ones.
I’d be pretty surprised if those ended up being the same thing as each other.
Yes, there’s several levels here, and it can get confusing which one you’re fiddling with or ‘exploring’ in. This can make the (pre-DL) MCTS literature hard to read because it can get fairly baroque as you use different heuristics at each level. What is useful for making choices how to explore the tree to eventually make the optimal choice is not the same thing as just bandit sampling. (It’s like the difference between best-arm finding and simple bandit minimizing regret: you don’t want to ‘minimize long-run regret’ when you explore a MCTS tree, because there is no long run: you want to make the right choice after you’ve done a small number of rollouts, because you need to take an action, and you will start planning again.) So while you can explore a MCTS tree with something simple & convenient like Bayesian Thompson sampling (if you’re predicting win rates as a binomial, that’s conjugate and so is very fast), it won’t work as well as something deliberately targeting exploration, or one which is budget-aware and tries to maximize the probability of finding best action within n rollouts.
Edit: Actually I’m still confused. If I’m reading the paper correctly, the SL policy network is trained to predict what the RL network would do, not to do the thing which maximizes value of information. I’d be pretty surprised if those ended up being the same thing as each other.
The SL policy network isn’t trained on any data from the RL policy network, just on predicting the next move in expert games.
The value network is what is trained on data from the RL policy network. It predicts if the RL policy network would win or lose from a certain position.
So if I’m understanding correctly:
The expected value of exploring a node in the game tree is different from the expected value of playing that node, and the policy network can be viewed as trying to choose the move distibution with maximal value of information.
And so one could view the outputs of the SL policy network as approximating the expected utility of exploring this region of the game tree, though you couldn’t directly trade off between winning-the-game utility and exploring-this-part-of-the-game-tree utility due to the particular way AlphaGo is used (flat 5s of search per turn).
Thanks!
Edit: Actually I’m still confused. If I’m reading the paper correctly, the SL policy network is trained to predict what the RL network would do, not to do the thing which maximizes value of information. I’d be pretty surprised if those ended up being the same thing as each other.
Yes, the expected value of playing a move is not the same as the expected value of exploring it while searching. “Maximal value of information” would be nice but the training isn’t explicitly going for that (I suspect that would be difficult), but for the simpler approximation of “how much was the subtree under this node looked at in a search?”.
So the idea is: suppose that when looking at positions like this one, it turned out on average that the search spent a lot of time exploring moves like this one. Then this is probably a good move to explore.
This isn’t the same thing as value of information. (At least, I don’t think it is. Though for all I know there might be some theorems saying that for large searches it tends to be, or something?) But it’s a bit like value of information, because the searching algorithm tries to spend its time looking at moves that are useful to look at, and if it’s doing a good job of this then more-explored moves are something like higher-expected-value-of-information ones.
Yes, there’s several levels here, and it can get confusing which one you’re fiddling with or ‘exploring’ in. This can make the (pre-DL) MCTS literature hard to read because it can get fairly baroque as you use different heuristics at each level. What is useful for making choices how to explore the tree to eventually make the optimal choice is not the same thing as just bandit sampling. (It’s like the difference between best-arm finding and simple bandit minimizing regret: you don’t want to ‘minimize long-run regret’ when you explore a MCTS tree, because there is no long run: you want to make the right choice after you’ve done a small number of rollouts, because you need to take an action, and you will start planning again.) So while you can explore a MCTS tree with something simple & convenient like Bayesian Thompson sampling (if you’re predicting win rates as a binomial, that’s conjugate and so is very fast), it won’t work as well as something deliberately targeting exploration, or one which is budget-aware and tries to maximize the probability of finding best action within n rollouts.
The SL policy network isn’t trained on any data from the RL policy network, just on predicting the next move in expert games.
The value network is what is trained on data from the RL policy network. It predicts if the RL policy network would win or lose from a certain position.