MCTS works as amplification because you can evaluate future board positions to get a convergent estimate of how well you’re doing—and then eventually someone actually wins the game, which keeps p from departing reality entirely. Importantly, the single thing you’re learning can play the role of the environment, too, by picking the opponents’ moves.
In trying to train A to predict human actions given access to A, you’re almost doing something similar. You have a prediction that’s also supposed to be a prediction of the environment (the human), so you can use it for both sides of a tree search. But A isn’t actually searching through an interesting tree—it’s searching for cycles of length 1 in its own model of the environment, with no particular guarantee that any cycles of length 1 exist or are a good idea. “Tree search” in this context (I think) means spraying out a bunch of outputs and hoping at least one falls into a fixed point upon iteration.
EDIT: Big oops, I didn’t actually understand what was being talked about here.
I agree there is a real sense in which AGZ is “better-grounded” (and more likely to be stable) than iterated amplification in general. (This was some of the motivation for the experiments here.)
Oh, I’ve just realized that the “tree” was always intended to be something like task decomposition. Sorry about that—that makes the analogy a lot tighter.
This is true when getting training data, but I think it’s a difference between A (or HCH) and AlphaGo Zero when doing simulation / amplification. Someone wins a simulated game of Go even if both players are making bad moves (or even random moves), which gives you a signal that A doesn’t have access to.
MCTS works as amplification because you can evaluate future board positions to get a convergent estimate of how well you’re doing—and then eventually someone actually wins the game, which keeps p from departing reality entirely. Importantly, the single thing you’re learning can play the role of the environment, too, by picking the opponents’ moves.
In trying to train A to predict human actions given access to A, you’re almost doing something similar. You have a prediction that’s also supposed to be a prediction of the environment (the human), so you can use it for both sides of a tree search. But A isn’t actually searching through an interesting tree—it’s searching for cycles of length 1 in its own model of the environment, with no particular guarantee that any cycles of length 1 exist or are a good idea. “Tree search” in this context (I think) means spraying out a bunch of outputs and hoping at least one falls into a fixed point upon iteration.
EDIT: Big oops, I didn’t actually understand what was being talked about here.
I agree there is a real sense in which AGZ is “better-grounded” (and more likely to be stable) than iterated amplification in general. (This was some of the motivation for the experiments here.)
Oh, I’ve just realized that the “tree” was always intended to be something like task decomposition. Sorry about that—that makes the analogy a lot tighter.
Isn’t A also grounded in reality by eventually giving no A to consult with?
This is true when getting training data, but I think it’s a difference between A (or HCH) and AlphaGo Zero when doing simulation / amplification. Someone wins a simulated game of Go even if both players are making bad moves (or even random moves), which gives you a signal that A doesn’t have access to.