gjm comments on Is AlphaGo actually a consequentialist utility maximizer?

gjm 7 Dec 2023 23:52 UTC
5 points
0
Yes, the expected value of playing a move is not the same as the expected value of exploring it while searching. “Maximal value of information” would be nice but the training isn’t explicitly going for that (I suspect that would be difficult), but for the simpler approximation of “how much was the subtree under this node looked at in a search?”.
So the idea is: suppose that when looking at positions like this one, it turned out on average that the search spent a lot of time exploring moves like this one. Then this is probably a good move to explore.
This isn’t the same thing as value of information. (At least, I don’t think it is. Though for all I know there might be some theorems saying that for large searches it tends to be, or something?) But it’s a bit like value of information, because the searching algorithm tries to spend its time looking at moves that are useful to look at, and if it’s doing a good job of this then more-explored moves are something like higher-expected-value-of-information ones.