‘reward is not the optimization target!* *except when it is in these annoying exceptions like AlphaZero, but fortunately, we can ignore these, because after all, it’s not like humans or AGI or superintelligences would ever do crazy stuff like “plan” or “reason” or “search”’.
If you’re going to mock me, at least be correct when you do it!
I think that reward is still not the optimization target in AlphaZero (the way I’m using the term, at least). Learning a leaf node evaluator on a given reinforcement signal, and then bootstrapping the leaf node evaluator via MCTS on that leaf node evaluator, does not mean that the aggregate trained system
directly optimizes for the reinforcement signal, or
“cares” about that reinforcement signal,
or “does its best” to optimize the reinforcement signal (as opposed to some historical reinforcement correlate, like winning or capturing pieces or something stranger).
If most of the “optimization power” were coming from e.g. MCTS on direct reward signal, then yup, I’d agree that the reward signal is the primary optimization target of this system. That isn’t the case here.
You might use the phrase “reward as optimization target” differently than I do, but if we’re just using words differently, then it wouldn’t be appropriate to describe me as “ignoring planning.”
Learning a leaf node evaluator on a given reinforcement signal, and then bootstrapping the leaf node evaluator via MCTS on that leaf node evaluator, does not mean that the aggregate trained system
directly optimizes for the reinforcement signal, or
“cares” about that reinforcement signal,
or “does its best” to optimize the reinforcement signal (as opposed to some historical reinforcement correlate, like winning or capturing pieces or something stranger).
Yes, it does mean all of that, because MCTS is asymptotically optimal (unsurprisingly, given that it’s a tree search on the model), and will eg. happily optimize the reinforcement signal rather than proxies like capturing pieces as it learns through search that capturing pieces in particular states is not as useful as usual. If you expand out the search tree long enough (whether or not you use the AlphaZero NN to make that expansion more efficient by evaluating intermediate nodes and then back-propagating that through the current tree), then it converges on the complete, true, ground truth game tree, with all leafs evaluated with the true reward, with any imperfections in the leaf evaluator value estimate washed out. It directly optimizes the reinforcement signal, cares about nothing else, and is very pleased to lose if that results in a higher reward or not capture pieces if that results in a higher reward.*
All the NN is, is a cache or an amortization of the search algorithm. Caches are important and life would be miserable without them, but it would be absurd to say that adding a cache to a function means “that function doesn’t compute the function” or “the range is not the target of the function”.
I’m a little baffled by this argument that because the NN is not already omniscient and might mis-estimate the value of a leaf node, that apparently it’s not optimizing for the reward and that’s not the goal of the system and the system doesn’t care about reward, no matter how much it converges toward said reward as it plans/searches more, or gets better at acquiring said reward as it fixes those errors.
If most of the “optimization power” were coming from e.g. MCTS on direct reward signal, then yup, I’d agree that the reward signal is the primary optimization target of this system.
The reward signal is in fact the primary optimization target, because it is where the neural net’s value estimates derive from, and the ‘system’ corrects them eventually and converges. The dog wags the tail, sooner or later.
* I think I’ve noted this elsewhere, and mentioned my Kelly coinflip trajectories as nice visualization of how model-based RL will behave as, but to repeat: MCTS algorithms in Go/chess were noted for that sort of behavior, especially for sacrificing pieces or territory while they were ahead, in order to ‘lock down’ the game and maximize the probability of victory, rather than the margin of victory; and vice-versa, for taking big risks when they were behind. Because the tree didn’t back-propagate any rewards on ‘margin’, just on 0⁄1 rewards from victory, and didn’t care about proxy heuristics like ‘pieces captured’ if the tree search found otherwise.
If you’re going to mock me, at least be correct when you do it!
I think that reward is still not the optimization target in AlphaZero (the way I’m using the term, at least). Learning a leaf node evaluator on a given reinforcement signal, and then bootstrapping the leaf node evaluator via MCTS on that leaf node evaluator, does not mean that the aggregate trained system
directly optimizes for the reinforcement signal, or
“cares” about that reinforcement signal,
or “does its best” to optimize the reinforcement signal (as opposed to some historical reinforcement correlate, like winning or capturing pieces or something stranger).
If most of the “optimization power” were coming from e.g. MCTS on direct reward signal, then yup, I’d agree that the reward signal is the primary optimization target of this system. That isn’t the case here.
You might use the phrase “reward as optimization target” differently than I do, but if we’re just using words differently, then it wouldn’t be appropriate to describe me as “ignoring planning.”
Yes, it does mean all of that, because MCTS is asymptotically optimal (unsurprisingly, given that it’s a tree search on the model), and will eg. happily optimize the reinforcement signal rather than proxies like capturing pieces as it learns through search that capturing pieces in particular states is not as useful as usual. If you expand out the search tree long enough (whether or not you use the AlphaZero NN to make that expansion more efficient by evaluating intermediate nodes and then back-propagating that through the current tree), then it converges on the complete, true, ground truth game tree, with all leafs evaluated with the true reward, with any imperfections in the leaf evaluator value estimate washed out. It directly optimizes the reinforcement signal, cares about nothing else, and is very pleased to lose if that results in a higher reward or not capture pieces if that results in a higher reward.*
All the NN is, is a cache or an amortization of the search algorithm. Caches are important and life would be miserable without them, but it would be absurd to say that adding a cache to a function means “that function doesn’t compute the function” or “the range is not the target of the function”.
I’m a little baffled by this argument that because the NN is not already omniscient and might mis-estimate the value of a leaf node, that apparently it’s not optimizing for the reward and that’s not the goal of the system and the system doesn’t care about reward, no matter how much it converges toward said reward as it plans/searches more, or gets better at acquiring said reward as it fixes those errors.
The reward signal is in fact the primary optimization target, because it is where the neural net’s value estimates derive from, and the ‘system’ corrects them eventually and converges. The dog wags the tail, sooner or later.
* I think I’ve noted this elsewhere, and mentioned my Kelly coinflip trajectories as nice visualization of how model-based RL will behave as, but to repeat: MCTS algorithms in Go/chess were noted for that sort of behavior, especially for sacrificing pieces or territory while they were ahead, in order to ‘lock down’ the game and maximize the probability of victory, rather than the margin of victory; and vice-versa, for taking big risks when they were behind. Because the tree didn’t back-propagate any rewards on ‘margin’, just on 0⁄1 rewards from victory, and didn’t care about proxy heuristics like ‘pieces captured’ if the tree search found otherwise.