Imagine our stamp collector is trained using meta-learning. 100 stamp collectors are trained in parallel and the inner loop, which uses gradient descent, updates their weights every 10 days. Every 50 days, the outer loop takes the 50 best-performing stamp collectors and copies their weights over to the 50 worst-performing stamp collectors. In doing so, the outer loop selects non-myopic models that maximize stamps over all days.
Nit (or possibly major disagreement): I don’t think this training regime gets you a stamp maximizer. I think this regime gets you a behaviors-similar-to-those-that-resulted-in-stamps-in-the-past-exhibiter. These behaviors might be non-myopic behaviors that nevertheless are not “evaluate the expected results of each possible action, and choose the action which yields the highest expected number of stamps”.
Why do you think that? A purely backwards-looking model-free approach will be outperformed and selected against compared to an agent which has been evolved to implement a more model-based approach, which can look forward and plan based on observations to immediately maximize future reward—rather than being forced to wait for rewards/selection to happen and incurring predictable losses before it can finally stop executing behaviors-that-used-to-stamp-maximize-but-now-no-longer-do-so-for-easily-predicted-reasons.
A purely backwards-looking model-free approach will be outperformed and selected against compared to an agent which has been evolved to implement a more model-based approach, which can look forward and plan based on observations
Why do you think that a model-based approach will outperform a model-free approach? We may just be using words differently here: I’m using “model-based” to mean “maintains an explicit model of its environment, its available actions, and the anticipated effects of those actions, and then performs whichever action its world model anticipates would have the best results”, and I’m using “model-free” to describe a policy which can be thought of as a big bag of “if the inputs look like this, express that behavior”, where the specific pattern of things the policy attends to and behaviors it expresses is determined by past reinforcement.[1] So something like AlphaGo as a whole would be considered model-based, but AlphaGo’s policy network, in isolation, would be considered model-free.
to immediately maximize future reward—rather than being forced to wait for rewards/selection to happen and incurring predictable losses before it can finally stop executing behaviors-that-used-to-stamp-maximize-but-now-no-longer-do-so-for-easily-predicted-reasons.
Reward is not the optimization target. In the training regime described, the outer loop selects for whichever stamp collectors performed best in the training conditions, and thus reinforces policies that led to high reward in the past.
“Evaluate the expected number of collected stamps according to my internal model for each possible action, and then choose the action which my internal model rates highest” might still end up as the highest-performing policy, if the following hold:
The internal model of the consequences of possible actions is highly accurate, to a sufficient degree that the tails do not come apart. If this doesn’t hold, the system will end up taking actions that are rated by its internal evaluator as being much higher value than they in fact end up being.
There exists nothing in the environment which will exploit inaccuracies in the internal value model.
There do not exist easier-to-discover non-consequentialist policies which yield better outcomes given the same amount of compute in the training environment.
I don’t expect all of those to hold in any non-toy domain. Even in toy domains like Go, the “estimations of value are sufficiently close to the actual value” assumption [empirically seems not to hold](https://arxiv.org/abs/2211.00241), and a two-player perfect-information game seems like a best-case scenario for consequentialist agents.
A “model-free” system may contain one or many learned internal structures which resemble its environment. For example, [OthelloGPT contains a learned model of the board state](https://www.neelnanda.io/mechanistic-interpretability/othello), and yet it is considered “model-free”. My working hypothesis is that the people who come up with this terminology are trying to make everything in ML as confusing as possible.
Why do you think that a model-based approach will outperform a model-free approach?
For the reason I just explained. Planning can optimize without actually requiring experiencing all states beforehand. A learned model lets you plan and predict things before they have happened {{citation needed}}. This is also a standard test of model-free vs model-based behavior in both algorithms and animals: model-based learning can update much faster, and being able to update after one reward, or to learn associations with other information (eg. following a sign) is considered experimental evidence for model-based learning rather than a model-free approach which must experience many episodes before it can reverse or undo all of the prior learning.
I want to defend “Reward is not the optimization target” a bit, while also mourning its apparent lack of clarity...These algorithms do optimize the reward. My post addresses the model-free policy gradient setting...
Should a model-based algorithm be trainable by the outer loop, it will learn to optimize for the reward. There is no reason you cannot use, say, ES or PBT for hyperparameter optimization of a model-based algorithm like AlphaZero, or even training the parameters too. They are operating at different levels. If you used those to train an AlphaZero, it doesn’t somehow stop being model-based RL: it continues to do planning over a simulator of Go expanding out the game tree out to terminal nodes with a hardwired reward function determined by whether it won or lost the game and its optimization target is the reward of winning/loosing. That remains true no matter what the outer loop is, whether it was evolution strategies or the actual Bayesian hyperparameter optimization that DM used.
As for your 3 conditions: they are either much easier than you make them sound or equally objections to model-free algorithms. #1 is usually irrelevant because you cannot apply unbounded optimization pressure (like in the T-maze experiments—there’s just nothing you can arbitrarily maximize, you go left, or you go right, end of story), and such overestimates can be useful for exploration. (Nor does model-freeness mean you can have inaccurate estimates of either actions or values.)) #2 is a non sequitur because model-freeness doesn’t grant you immunity to game theory either. #3 is just irrelevant by stipulation if there is any useful planning to be done, and you’re moving the goalposts as nothing was said about compute limits. (Not that it is self-evidently true that model-free is cheaper either, particularly in complex environments: policy gradients in particular tends to be extremely expensive to train due to sample-inefficiency and being on-policy, and the model-based Go agents like AlphaZero and MuZero outperform any model-free approaches I am aware of—I don’t know why you think the adversarial KataGo examples are relevant, when model-free approaches tend to be even more vulnerable to adversarial examples, that was the whole point of going from AlphaGo to AlphaZero, they couldn’t beat the delusions with policy gradient approaches. If model-based could be so easily outperformed, why does it ever exist, like you or I do?)
I think that you still haven’t quite grasped what I was saying. Reward is not the optimization target totally applies here. (It was the post itself which only analyzed the model-free case, not that the lesson only applies to the model-free case.)
In the partial quote you provided, I was discussing two specific algorithms which are highly dissimilar to those being discussed here. If (as we were discussing), you’re doing MCTS (or “full-blown backwards induction”) on reward for the leaf nodes, the system optimizes the reward. That is—if most of the optimization power comes from explicit search on an explicit reward criterion (as in AIXI), then you’re optimizing for reward. If you’re doing e.g. AlphaZero, that aggregate system isn’t optimizing for reward.
Despite the derision which accompanies your discussion of Reward is not the optimization target, it seems to me that you still do not understand the points I’m trying to communicate. You should be aware that I don’t think you understand my views or that post’s intended lesson. As I offered before, I’d be open to discussing this more at length if you want clarification.
Nit (or possibly major disagreement): I don’t think this training regime gets you a stamp maximizer. I think this regime gets you a behaviors-similar-to-those-that-resulted-in-stamps-in-the-past-exhibiter. These behaviors might be non-myopic behaviors that nevertheless are not “evaluate the expected results of each possible action, and choose the action which yields the highest expected number of stamps”.
Why do you think that? A purely backwards-looking model-free approach will be outperformed and selected against compared to an agent which has been evolved to implement a more model-based approach, which can look forward and plan based on observations to immediately maximize future reward—rather than being forced to wait for rewards/selection to happen and incurring predictable losses before it can finally stop executing behaviors-that-used-to-stamp-maximize-but-now-no-longer-do-so-for-easily-predicted-reasons.
Why do you think that a model-based approach will outperform a model-free approach? We may just be using words differently here: I’m using “model-based” to mean “maintains an explicit model of its environment, its available actions, and the anticipated effects of those actions, and then performs whichever action its world model anticipates would have the best results”, and I’m using “model-free” to describe a policy which can be thought of as a big bag of “if the inputs look like this, express that behavior”, where the specific pattern of things the policy attends to and behaviors it expresses is determined by past reinforcement.[1] So something like AlphaGo as a whole would be considered model-based, but AlphaGo’s policy network, in isolation, would be considered model-free.
Reward is not the optimization target. In the training regime described, the outer loop selects for whichever stamp collectors performed best in the training conditions, and thus reinforces policies that led to high reward in the past.
“Evaluate the expected number of collected stamps according to my internal model for each possible action, and then choose the action which my internal model rates highest” might still end up as the highest-performing policy, if the following hold:
The internal model of the consequences of possible actions is highly accurate, to a sufficient degree that the tails do not come apart. If this doesn’t hold, the system will end up taking actions that are rated by its internal evaluator as being much higher value than they in fact end up being.
There exists nothing in the environment which will exploit inaccuracies in the internal value model.
There do not exist easier-to-discover non-consequentialist policies which yield better outcomes given the same amount of compute in the training environment.
I don’t expect all of those to hold in any non-toy domain. Even in toy domains like Go, the “estimations of value are sufficiently close to the actual value” assumption [empirically seems not to hold](https://arxiv.org/abs/2211.00241), and a two-player perfect-information game seems like a best-case scenario for consequentialist agents.
A “model-free” system may contain one or many learned internal structures which resemble its environment. For example, [OthelloGPT contains a learned model of the board state](https://www.neelnanda.io/mechanistic-interpretability/othello), and yet it is considered “model-free”. My working hypothesis is that the people who come up with this terminology are trying to make everything in ML as confusing as possible.
For the reason I just explained. Planning can optimize without actually requiring experiencing all states beforehand. A learned model lets you plan and predict things before they have happened {{citation needed}}. This is also a standard test of model-free vs model-based behavior in both algorithms and animals: model-based learning can update much faster, and being able to update after one reward, or to learn associations with other information (eg. following a sign) is considered experimental evidence for model-based learning rather than a model-free approach which must experience many episodes before it can reverse or undo all of the prior learning.
As Turntrout has already noted, that does not apply to model-based algorithms, and they ‘do optimize the reward’:
Should a model-based algorithm be trainable by the outer loop, it will learn to optimize for the reward. There is no reason you cannot use, say, ES or PBT for hyperparameter optimization of a model-based algorithm like AlphaZero, or even training the parameters too. They are operating at different levels. If you used those to train an AlphaZero, it doesn’t somehow stop being model-based RL: it continues to do planning over a simulator of Go expanding out the game tree out to terminal nodes with a hardwired reward function determined by whether it won or lost the game and its optimization target is the reward of winning/loosing. That remains true no matter what the outer loop is, whether it was evolution strategies or the actual Bayesian hyperparameter optimization that DM used.
As for your 3 conditions: they are either much easier than you make them sound or equally objections to model-free algorithms. #1 is usually irrelevant because you cannot apply unbounded optimization pressure (like in the T-maze experiments—there’s just nothing you can arbitrarily maximize, you go left, or you go right, end of story), and such overestimates can be useful for exploration. (Nor does model-freeness mean you can have inaccurate estimates of either actions or values.)) #2 is a non sequitur because model-freeness doesn’t grant you immunity to game theory either. #3 is just irrelevant by stipulation if there is any useful planning to be done, and you’re moving the goalposts as nothing was said about compute limits. (Not that it is self-evidently true that model-free is cheaper either, particularly in complex environments: policy gradients in particular tends to be extremely expensive to train due to sample-inefficiency and being on-policy, and the model-based Go agents like AlphaZero and MuZero outperform any model-free approaches I am aware of—I don’t know why you think the adversarial KataGo examples are relevant, when model-free approaches tend to be even more vulnerable to adversarial examples, that was the whole point of going from AlphaGo to AlphaZero, they couldn’t beat the delusions with policy gradient approaches. If model-based could be so easily outperformed, why does it ever exist, like you or I do?)
I think that you still haven’t quite grasped what I was saying. Reward is not the optimization target totally applies here. (It was the post itself which only analyzed the model-free case, not that the lesson only applies to the model-free case.)
In the partial quote you provided, I was discussing two specific algorithms which are highly dissimilar to those being discussed here. If (as we were discussing), you’re doing MCTS (or “full-blown backwards induction”) on reward for the leaf nodes, the system optimizes the reward. That is—if most of the optimization power comes from explicit search on an explicit reward criterion (as in AIXI), then you’re optimizing for reward. If you’re doing e.g. AlphaZero, that aggregate system isn’t optimizing for reward.
Despite the derision which accompanies your discussion of Reward is not the optimization target, it seems to me that you still do not understand the points I’m trying to communicate. You should be aware that I don’t think you understand my views or that post’s intended lesson. As I offered before, I’d be open to discussing this more at length if you want clarification.
CC @faul_sname