I’m really glad you wrote this. I’ve thought for some time that it’s an important distinction, though I think you’ve articulated (at least parts of) it better than my attempts perhaps! I previously described a distinction between deliberation and reaction.
mesaoptimizers that could form across multiple forward passes
Yes, leaving aside really really deep networks and residuals, I think some sort of recurrence/iteration is plausibly needed for meaningful deliberation to occur. Chain of thought is an obvious instantiation (but so is sequential reasoning absent explicit CoT prompting), with MCTS examples (which you also mentioned) being perhaps more central.
I’ll gesture at some pieces of this puzzle which I haven’t got round to writing about properly publicly[1] but where I’d be interested in your thoughts:
Where do highly capable proposals/amortised actions come from?
(handwave) lots of ‘experience’ and ‘good generalisation’?
How do you get good ‘experience’?
Helpful human curates massive dataset for you
Or...? This seems to be to be where active learning and deliberate/creative exploration come in
It’s a Bayes-adaptivity problem, i.e. planning for value-of-information
This is basically what ‘science’ and ‘experimentalism’ are in my ontology
‘Play’ and ‘practice’ are the amortised equivalent (where explorative heuristics are baked in)
animals are evidence that some amortised play heuristics are effective! Even humans only rarely ‘actually do deliberate experimentalism’
but when we do, it’s maybe the source of our massive technological dominance?
Either that or you have something terribly slow like natural selection
This is the most magical part of the picture to me at present
When is deliberation/direct planning tractable?
In any interestingly-large problem, you will never exhaustively evaluate
e.g. maybe no physically realisable computer in our world can ever evaluate all Go strategies, much less evaluating strategies for ‘operate in the world itself’!
What properties of options/proposals lend themselves?
(handwave) ‘Interestingly consequential’ - the differences should actually matter enough to bother computing!
Temporally flexible
The ‘temporal resolution’ of the strategy-value landscape may vary by orders of magnitude
so the temporal resolution of the proposals (or proposal-atoms) should too, on pain of intractability/value-loss/both
Where does strong control/optimisation come from?
Hi there! Thanks for this comment. Here are my thoughts:
Where do highly capable proposals/amortised actions come from?
(handwave) lots of ‘experience’ and ‘good generalisation’?
Pretty much this. We know empirically that deep learning generalizes pretty well from a lot of data as long as it is reasonable representative. I think that fundamentally this is due to the nature of our reality that there are generalizable patterns which is ultimately due to the sparse underlying causal graph. It is very possible that there are realities where this isn’t true and in those cases this kind of ‘intelligence’ would not be possible.
r...? This seems to be to be where active learning and deliberate/creative exploration come in
It’s a Bayes-adaptivity problem, i.e. planning for value-of-information
This is basically what ‘science’ and ‘experimentalism’ are in my ontology
‘Play’ and ‘practice’ are the amortised equivalent (where explorative heuristics are baked in)
Again, I completely agree here. In practice in large environments it is necessary to explore if you can’t reach all useful states from a random policy. In these cases, it is very useful to a.) have an explicit world model so you can learn from sensory information which is much higher bandwidth than reward usually and generalizes further and in an uncorrelated way, and b.) do some kind of active exploration. Exploring according to maximizing info-gain is probably close to optimal, although whether this is actually theoretically optimal is I tihnk still an open question. The main issue is that info-gain is hard to cmopute/approximate tractably, since it requires keeping a close track of your uncertainty, and DL models are computationally tractable by explicitly throwing away all the uncertainty and only really maintaining point predictions.
animals are evidence that some amortised play heuristics are effective! Even humans only rarely ‘actually do deliberate experimentalism’
but when we do, it’s maybe the source of our massive technological dominance?
Like I don’t know to what extent there are ‘play heuristics’ at a behavioural level vs some kind of intrinsic drive for novelty / information gain but yes, having these drives ‘added to your reward function’ is generally useful in RL settings and we know this happens in the brain as well—i.e. there are dopamine neurons responsive to proxies of information gain (and exactly equal to information gain in simple bandit-like settings where this is tractable)
When is deliberation/direct planning tractable?
In any interestingly-large problem, you will never exhaustively evaluate
e.g. maybe no physically realisable computer in our world can ever evaluate all Go strategies, much less evaluating strategies for ‘operate in the world itself’!
What properties of options/proposals lend themselves?
(handwave) ‘Interestingly consequential’ - the differences should actually matter enough to bother computing!
Temporally flexible
The ‘temporal resolution’ of the strategy-value landscape may vary by orders of magnitude
so the temporal resolution of the proposals (or proposal-atoms) should too, on pain of intractability/value-loss/both
So there are a number of circumstances where direct planning is valuable and useful. I agree about your conditions and especially the correct action step-size as well as discrete actions and known not super stochastic dynamics. Other useful conditions are when it’s easy to evaluate the branches of the tree without having gone all the way down to the leaves—i.e. in games like Chess/GO it’s often very easy to know that some move tree is intrinsically doomed without having explored all of it. This is a kind of convexity to the state space (not literally mathematically, but intuitively) which makes optimization much easier. Similarly, when good proposals can be made due to linearity / generalizability in the action space it is easy to prune actions and trees.
Where does strong control/optimisation come from?
Strong control comes from where strong learning in general comes from—lots of compute and data—and for planning especially compute. The optimal trade-off between amortized and direct optimization given a fixed compute budget is super interesting and I don’t think we have any good models of this yet.
Another thing that I think is fairly underestimated among people on LW compared to people doing deep RL is that open-loop planning is actually very hard and bad at dealing with long time horizons. This is basically due to stochasticity and chaos theory—future prediction is hard. Small mistakes in either modelling or action propagate very rapidly to create massive uncertainties about the future so that your optimal posterior rapidly dwindles to a maximum entropy distribution. The key thing in long term planning is really adaptability and closed-loop control—i.e. seeing feedback and adjusting your actions in response to feedback. This is how almost all practical control systems actually work and in practice in deep RL with planning everybody actually uses MPC so replans every step.
I’m really glad you wrote this. I’ve thought for some time that it’s an important distinction, though I think you’ve articulated (at least parts of) it better than my attempts perhaps! I previously described a distinction between deliberation and reaction.
Yes, leaving aside really really deep networks and residuals, I think some sort of recurrence/iteration is plausibly needed for meaningful deliberation to occur. Chain of thought is an obvious instantiation (but so is sequential reasoning absent explicit CoT prompting), with MCTS examples (which you also mentioned) being perhaps more central.
I’ll gesture at some pieces of this puzzle which I haven’t got round to writing about properly publicly[1] but where I’d be interested in your thoughts:
Where do highly capable proposals/amortised actions come from?
(handwave) lots of ‘experience’ and ‘good generalisation’?
How do you get good ‘experience’?
Helpful human curates massive dataset for you
Or...? This seems to be to be where active learning and deliberate/creative exploration come in
It’s a Bayes-adaptivity problem, i.e. planning for value-of-information
This is basically what ‘science’ and ‘experimentalism’ are in my ontology
‘Play’ and ‘practice’ are the amortised equivalent (where explorative heuristics are baked in)
animals are evidence that some amortised play heuristics are effective! Even humans only rarely ‘actually do deliberate experimentalism’
but when we do, it’s maybe the source of our massive technological dominance?
Either that or you have something terribly slow like natural selection
How do you get good ‘generalisation’?
(handwave) hierarchical/recomposable abstractions?
This is the most magical part of the picture to me at present
When is deliberation/direct planning tractable?
In any interestingly-large problem, you will never exhaustively evaluate
e.g. maybe no physically realisable computer in our world can ever evaluate all Go strategies, much less evaluating strategies for ‘operate in the world itself’!
What properties of options/proposals lend themselves?
(handwave) ‘Interestingly consequential’ - the differences should actually matter enough to bother computing!
Temporally flexible
The ‘temporal resolution’ of the strategy-value landscape may vary by orders of magnitude
so the temporal resolution of the proposals (or proposal-atoms) should too, on pain of intractability/value-loss/both
Where does strong control/optimisation come from?
Your strategy is inevitably suboptimal
You can have a good shot
but keeping ‘yourself’ around to deliberate again or more generally making there be more similarly-oriented deliberators is a great way to handle this
Maybe the closest is this scrappy comment?
Hi there! Thanks for this comment. Here are my thoughts:
Pretty much this. We know empirically that deep learning generalizes pretty well from a lot of data as long as it is reasonable representative. I think that fundamentally this is due to the nature of our reality that there are generalizable patterns which is ultimately due to the sparse underlying causal graph. It is very possible that there are realities where this isn’t true and in those cases this kind of ‘intelligence’ would not be possible.
Again, I completely agree here. In practice in large environments it is necessary to explore if you can’t reach all useful states from a random policy. In these cases, it is very useful to a.) have an explicit world model so you can learn from sensory information which is much higher bandwidth than reward usually and generalizes further and in an uncorrelated way, and b.) do some kind of active exploration. Exploring according to maximizing info-gain is probably close to optimal, although whether this is actually theoretically optimal is I tihnk still an open question. The main issue is that info-gain is hard to cmopute/approximate tractably, since it requires keeping a close track of your uncertainty, and DL models are computationally tractable by explicitly throwing away all the uncertainty and only really maintaining point predictions.
Like I don’t know to what extent there are ‘play heuristics’ at a behavioural level vs some kind of intrinsic drive for novelty / information gain but yes, having these drives ‘added to your reward function’ is generally useful in RL settings and we know this happens in the brain as well—i.e. there are dopamine neurons responsive to proxies of information gain (and exactly equal to information gain in simple bandit-like settings where this is tractable)
So there are a number of circumstances where direct planning is valuable and useful. I agree about your conditions and especially the correct action step-size as well as discrete actions and known not super stochastic dynamics. Other useful conditions are when it’s easy to evaluate the branches of the tree without having gone all the way down to the leaves—i.e. in games like Chess/GO it’s often very easy to know that some move tree is intrinsically doomed without having explored all of it. This is a kind of convexity to the state space (not literally mathematically, but intuitively) which makes optimization much easier. Similarly, when good proposals can be made due to linearity / generalizability in the action space it is easy to prune actions and trees.
Strong control comes from where strong learning in general comes from—lots of compute and data—and for planning especially compute. The optimal trade-off between amortized and direct optimization given a fixed compute budget is super interesting and I don’t think we have any good models of this yet.
Another thing that I think is fairly underestimated among people on LW compared to people doing deep RL is that open-loop planning is actually very hard and bad at dealing with long time horizons. This is basically due to stochasticity and chaos theory—future prediction is hard. Small mistakes in either modelling or action propagate very rapidly to create massive uncertainties about the future so that your optimal posterior rapidly dwindles to a maximum entropy distribution. The key thing in long term planning is really adaptability and closed-loop control—i.e. seeing feedback and adjusting your actions in response to feedback. This is how almost all practical control systems actually work and in practice in deep RL with planning everybody actually uses MPC so replans every step.