gwern comments on Non-myopia stories

gwern Dec 10, 2023, 3:00 AM
LW: 5 AF: 4
−8
AF

Why do you think that a model-based approach will outperform a model-free approach?

For the reason I just explained. Planning can optimize without actually requiring experiencing all states beforehand. A learned model lets you plan and predict things before they have happened {{citation needed}}. This is also a standard test of model-free vs model-based behavior in both algorithms and animals: model-based learning can update much faster, and being able to update after one reward, or to learn associations with other information (eg. following a sign) is considered experimental evidence for model-based learning rather than a model-free approach which must experience many episodes before it can reverse or undo all of the prior learning.

Reward is not the optimization target.

As Turntrout has already noted, that does not apply to model-based algorithms, and they ‘do optimize the reward’:

I want to defend “Reward is not the optimization target” a bit, while also mourning its apparent lack of clarity...These algorithms do optimize the reward. My post addresses the model-free policy gradient setting...

Should a model-based algorithm be trainable by the outer loop, it will learn to optimize for the reward. There is no reason you cannot use, say, ES or PBT for hyperparameter optimization of a model-based algorithm like AlphaZero, or even training the parameters too. They are operating at different levels. If you used those to train an AlphaZero, it doesn’t somehow stop being model-based RL: it continues to do planning over a simulator of Go expanding out the game tree out to terminal nodes with a hardwired reward function determined by whether it won or lost the game and its optimization target is the reward of winning/loosing. That remains true no matter what the outer loop is, whether it was evolution strategies or the actual Bayesian hyperparameter optimization that DM used.

As for your 3 conditions: they are either much easier than you make them sound or equally objections to model-free algorithms. #1 is usually irrelevant because you cannot apply unbounded optimization pressure (like in the T-maze experiments—there’s just nothing you can arbitrarily maximize, you go left, or you go right, end of story), and such overestimates can be useful for exploration. (Nor does model-freeness mean you can have inaccurate estimates of either actions or values.)) #2 is a non sequitur because model-freeness doesn’t grant you immunity to game theory either. #3 is just irrelevant by stipulation if there is any useful planning to be done, and you’re moving the goalposts as nothing was said about compute limits. (Not that it is self-evidently true that model-free is cheaper either, particularly in complex environments: policy gradients in particular tends to be extremely expensive to train due to sample-inefficiency and being on-policy, and the model-based Go agents like AlphaZero and MuZero outperform any model-free approaches I am aware of—I don’t know why you think the adversarial KataGo examples are relevant, when model-free approaches tend to be even more vulnerable to adversarial examples, that was the whole point of going from AlphaGo to AlphaZero, they couldn’t beat the delusions with policy gradient approaches. If model-based could be so easily outperformed, why does it ever exist, like you or I do?)
- TurnTrout Apr 5, 2024, 1:59 AM
  LW: 9 AF: 2
  0
  AF Parent
  As Turntrout has already noted, that does not apply to model-based algorithms, and they ‘do optimize the reward’:
  I think that you still haven’t quite grasped what I was saying. Reward is not the optimization target totally applies here. (It was the post itself which only analyzed the model-free case, not that the lesson only applies to the model-free case.)
  In the partial quote you provided, I was discussing two specific algorithms which are highly dissimilar to those being discussed here. If (as we were discussing), you’re doing MCTS (or “full-blown backwards induction”) on reward for the leaf nodes, the system optimizes the reward. That is—if most of the optimization power comes from explicit search on an explicit reward criterion (as in AIXI), then you’re optimizing for reward. If you’re doing e.g. AlphaZero, that aggregate system isn’t optimizing for reward.
  Despite the derision which accompanies your discussion of Reward is not the optimization target, it seems to me that you still do not understand the points I’m trying to communicate. You should be aware that I don’t think you understand my views or that post’s intended lesson. As I offered before, I’d be open to discussing this more at length if you want clarification.
  CC @faul_sname