Yes, my instant thought too was “this sounds like a variant on a successor function”.
Of course, the real answer is that if you are worried about the slowness of bootstrapping back value estimates or short eligibility traces, this mostly just shows the fundamental problem with model-free RL and why you want to use models: models don’t need any environmental transitions to solve the use case presented:
But what if it learns of a path E → B? Or a shortcut A → C? Or a path F → G that gives a huge amount of reward? Because these techniques work by chaining the reward backwards step-by-step, it seems like this would be hard to learn well. Like the Bellman equation will still be approximately satisfied, for instance.
If the MBRL agent has learned a good reward-sensitive model of the environmental dynamics, then it will have already figured out E->B and so on, or could do so offline by planning; or if it had not because it is still learning the environment model, it would have a prior probability over the possibility that E->B gives a huge amount of reward, and it can calculate a VoI and target E->B in the next episode for exploration, and on observing the huge reward, update the model, replan, and so immediately begin taking E->B actions within that episode and all future episodes, and benefiting from generalization because it can also update the model everywhere for all E->B-like paths and all similar paths (which might now suddenly have much higher VoI and be worth targeting for further exploration) rather than simply those specific states’ value-estimates, and so on.
(And this is one of the justifications for successor representations: it pulls model-free agents a bit towards model-based-like behavior.)
With MBRL, don’t you end up with the same problem, but when planning in the model instead? E.g. DreamerV3 still learns a value function in their actor-critic reinforcement learning that occurs “in the model”. This value function still needs to chain the estimates backwards.
It’s the ‘same problem’, maybe, but it’s a lot easier to solve when you have an explicit model! You have something you can plan over, don’t need to interact with an environment out in the real world, and can do things like tree search or differentiating through the environmental dynamics model to do gradient ascent on the action-inputs to maximize the reward (while holding the model fixed). Same as training the neural network, once it’s differentiable—backprop can ‘chain the estimates backwards’ so efficiently you barely even think about it anymore. (It just holds the input and output fixed while updating the model.) Or distilling a tree search into a NN—the tree search needed to do backwards induction of updated estimates from all the terminal nodes all the way up to the root where the next action is chosen, but that’s very fast and explicit and can be distilled down into a NN forward pass.
And aside from being able to update within-episode or take actions entirely unobserved before, when you do MBRL, you get to do it at arbitrary scale (thus potentially extremely little wallclock time like an AlphaZero), offline (no environment interactions), potentially highly sample-efficient (if the dataset is adequate or one can do optimal experimentation to acquire the most useful data, like PILCO), with transfer learning to all other problems in related environments (because value functions are mostly worthless outside the exact setting, which is why model-free DRL agents are notorious for overfitting and having zero-transfer), easily eliciting meta-learning and zero-shot capabilities, etc.*
* Why yes, all of this does sound a lot like how you train a LLM today and what it is able to do, how curious
Same as training the neural network, once it’s differentiable—backprop can ‘chain the estimates backwards’ so efficiently you barely even think about it anymore.
I don’t think this is true in general. Unrolling an episode for longer steps takes more resources, and the later steps in the episode become more chaotic. DreamerV3 only unrolls for 16 steps.
Or distilling a tree search into a NN—the tree search needed to do backwards induction of updated estimates from all the terminal nodes all the way up to the root where the next action is chosen, but that’s very fast and explicit and can be distilled down into a NN forward pass.
But when you distill a tree search, you basically learn value estimates, i.e. something similar to a Q function (realistically, V function). Thus, here you also have an opportunity to bubble up some additional information.
And aside from being able to update within-episode or take actions entirely unobserved before, when you do MBRL, you get to do it at arbitrary scale (thus potentially extremely little wallclock time like an AlphaZero), offline (no environment interactions), potentially highly sample-efficient (if the dataset is adequate or one can do optimal experimentation to acquire the most useful data, like PILCO), with transfer learning to all other problems in related environments (because value functions are mostly worthless outside the exact setting, which is why model-free DRL agents are notorious for overfitting and having zero-transfer), easily eliciting meta-learning and zero-shot capabilities, etc.*
I’m not doubting the relevance of MBRL, I expect that to take off too. What I’m doubting is that future agents will be controlled using scalar utilities/rewards/etc. rather than something more nuanced.
I don’t think this is true in general. Unrolling an episode for longer steps takes more resources, and the later steps in the episode become more chaotic.
Those are two different things. The unrolling of the episode is still very cheap. It’s a lot cheaper to unroll a Dreamerv3 for 16 steps, then it is to go out into the world and run a robot in a real-world task for 16 steps and try to get the NN to propagate updated value estimates the entire way… (Given how small a Dreamer is, it may even be computationally cheaper to do some gradient ascent on it than it is to run whatever simulated environment you might be using! Especially given simulated environments will increasingly be large generative models, which incorporate lots of reward-irrelevant stuff.) The usefulness of the planning is a different thing, and might also be true for other planning methods in that environment too—if the environment is difficult, a tree search with a very small planning budget like just a few rollouts is probably going to have quite noisy choices/estimates too. No free lunches.
But when you distill a tree search, you basically learn value estimates
This is again doing the same thing as ‘the same problem’; yes, you are learning value estimates, but you are doing so better than alternatives, and better is better.. The AlphaGo network loses to the AlphaZero network, and the latter, in addition to just being quantitatively much better, also seems to have qualitatively different behavior, like fixing the ‘delusions’ (cf. AlphaStar).
What I’m doubting is that future agents will be controlled using scalar utilities/rewards/etc. rather than something more nuanced.
They won’t be controlled by something as simple as a single fixed reward function, I think we can agree on that. But I don’t find successor-function like representations to be too promising as a direction for how to generalize agents, or, in fact, any attempt to fancily hand-engineer in these sorts of approaches into DRL agents.
These things should be learned. For example, leaning into Decision Transformers and using a lot more conditionalizing through metadata and relying on meta-learning seems much more promising. (When it comes to generative models, if conditioning isn’t solving your problems, you’re just not using enough conditioning or generative modeling.) A prompt can describe agents and reward functions and the base agent executes that, and whatever is useful about successor-like representations just emerges automatically internally as the solution to the overall family of tasks in turning histories into actions.
The unrolling of the episode is still very cheap. It’s a lot cheaper to unroll a Dreamerv3 for 16 steps, then it is to go out into the world and run a robot in a real-world task for 16 steps and try to get the NN to propagate updated value estimates the entire way...
But I’m not advocating against MBRL, so this isn’t the relevant counterfactual. A pure MBRL-based approach would update the value function to match the rollouts, but e.g. DreamerV3 also uses the value function in a Bellman-like manner to e.g. impute the future reward at the end of an episode. This allows it to plan for further than the 16 steps it rolls out, but it would be computationally intractable to roll out for as far as this ends up planning.
if the environment is difficult, a tree search with a very small planning budget like just a few rollouts is probably going to have quite noisy choices/estimates too. No free lunches.
It’s possible for there to be a kind of chaos where the analytic gradients blow up yet discrete differences have predictable effects. Bifurcations etc..
They won’t be controlled by something as simple as a single fixed reward function, I think we can agree on that. But I don’t find successor-function like representations to be too promising as a direction for how to generalize agents, or, in fact, any attempt to fancily hand-engineer in these sorts of approaches into DRL agents.
These things should be learned. For example, leaning into Decision Transformers and using a lot more conditionalizing through metadata and relying on meta-learning seems much more promising. (When it comes to generative models, if conditioning isn’t solving your problems, you’re just not using enough conditioning or generative modeling.) A prompt can describe agents and reward functions and the base agent executes that, and whatever is useful about successor-like representations just emerges automatically internally as the solution to the overall family of tasks in turning histories into actions.
I agree with things needing to be learned; using the actual states themselves was more of a toy model (because we have mathematical models for MDPs but we don’t have mathematical models for “capabilities researchers will find something that can be Learned”), and I’d expect something else to happen. If I was to run off to implement this now, I’d be using learned embeddings of states, rather than states themselves. Though of course even learned embeddings have their problems.
The trouble with just saying “let’s use decision transformers” is twofold. First, we still need to actually define the feedback system. One option is to just define reward as the feedback, but as you mention, that’s not nuanced enough. You could use some system that’s trained to mimic human labels as the ground truth, but this kind of system has flaws for standard alignment reasons.
It seems to me that capabilities researchers are eventually going to find some clever feedback system to use. It will to a great extent be learned, but they’re going to need to figure out the learning method too.
Yes, my instant thought too was “this sounds like a variant on a successor function”.
Of course, the real answer is that if you are worried about the slowness of bootstrapping back value estimates or short eligibility traces, this mostly just shows the fundamental problem with model-free RL and why you want to use models: models don’t need any environmental transitions to solve the use case presented:
If the MBRL agent has learned a good reward-sensitive model of the environmental dynamics, then it will have already figured out E->B and so on, or could do so offline by planning; or if it had not because it is still learning the environment model, it would have a prior probability over the possibility that E->B gives a huge amount of reward, and it can calculate a VoI and target E->B in the next episode for exploration, and on observing the huge reward, update the model, replan, and so immediately begin taking E->B actions within that episode and all future episodes, and benefiting from generalization because it can also update the model everywhere for all E->B-like paths and all similar paths (which might now suddenly have much higher VoI and be worth targeting for further exploration) rather than simply those specific states’ value-estimates, and so on.
(And this is one of the justifications for successor representations: it pulls model-free agents a bit towards model-based-like behavior.)
With MBRL, don’t you end up with the same problem, but when planning in the model instead? E.g. DreamerV3 still learns a value function in their actor-critic reinforcement learning that occurs “in the model”. This value function still needs to chain the estimates backwards.
It’s the ‘same problem’, maybe, but it’s a lot easier to solve when you have an explicit model! You have something you can plan over, don’t need to interact with an environment out in the real world, and can do things like tree search or differentiating through the environmental dynamics model to do gradient ascent on the action-inputs to maximize the reward (while holding the model fixed). Same as training the neural network, once it’s differentiable—backprop can ‘chain the estimates backwards’ so efficiently you barely even think about it anymore. (It just holds the input and output fixed while updating the model.) Or distilling a tree search into a NN—the tree search needed to do backwards induction of updated estimates from all the terminal nodes all the way up to the root where the next action is chosen, but that’s very fast and explicit and can be distilled down into a NN forward pass.
And aside from being able to update within-episode or take actions entirely unobserved before, when you do MBRL, you get to do it at arbitrary scale (thus potentially extremely little wallclock time like an AlphaZero), offline (no environment interactions), potentially highly sample-efficient (if the dataset is adequate or one can do optimal experimentation to acquire the most useful data, like PILCO), with transfer learning to all other problems in related environments (because value functions are mostly worthless outside the exact setting, which is why model-free DRL agents are notorious for overfitting and having zero-transfer), easily eliciting meta-learning and zero-shot capabilities, etc.*
* Why yes, all of this does sound a lot like how you train a LLM today and what it is able to do, how curious
I don’t think this is true in general. Unrolling an episode for longer steps takes more resources, and the later steps in the episode become more chaotic. DreamerV3 only unrolls for 16 steps.
But when you distill a tree search, you basically learn value estimates, i.e. something similar to a Q function (realistically, V function). Thus, here you also have an opportunity to bubble up some additional information.
I’m not doubting the relevance of MBRL, I expect that to take off too. What I’m doubting is that future agents will be controlled using scalar utilities/rewards/etc. rather than something more nuanced.
Those are two different things. The unrolling of the episode is still very cheap. It’s a lot cheaper to unroll a Dreamerv3 for 16 steps, then it is to go out into the world and run a robot in a real-world task for 16 steps and try to get the NN to propagate updated value estimates the entire way… (Given how small a Dreamer is, it may even be computationally cheaper to do some gradient ascent on it than it is to run whatever simulated environment you might be using! Especially given simulated environments will increasingly be large generative models, which incorporate lots of reward-irrelevant stuff.) The usefulness of the planning is a different thing, and might also be true for other planning methods in that environment too—if the environment is difficult, a tree search with a very small planning budget like just a few rollouts is probably going to have quite noisy choices/estimates too. No free lunches.
This is again doing the same thing as ‘the same problem’; yes, you are learning value estimates, but you are doing so better than alternatives, and better is better.. The AlphaGo network loses to the AlphaZero network, and the latter, in addition to just being quantitatively much better, also seems to have qualitatively different behavior, like fixing the ‘delusions’ (cf. AlphaStar).
They won’t be controlled by something as simple as a single fixed reward function, I think we can agree on that. But I don’t find successor-function like representations to be too promising as a direction for how to generalize agents, or, in fact, any attempt to fancily hand-engineer in these sorts of approaches into DRL agents.
These things should be learned. For example, leaning into Decision Transformers and using a lot more conditionalizing through metadata and relying on meta-learning seems much more promising. (When it comes to generative models, if conditioning isn’t solving your problems, you’re just not using enough conditioning or generative modeling.) A prompt can describe agents and reward functions and the base agent executes that, and whatever is useful about successor-like representations just emerges automatically internally as the solution to the overall family of tasks in turning histories into actions.
But I’m not advocating against MBRL, so this isn’t the relevant counterfactual. A pure MBRL-based approach would update the value function to match the rollouts, but e.g. DreamerV3 also uses the value function in a Bellman-like manner to e.g. impute the future reward at the end of an episode. This allows it to plan for further than the 16 steps it rolls out, but it would be computationally intractable to roll out for as far as this ends up planning.
It’s possible for there to be a kind of chaos where the analytic gradients blow up yet discrete differences have predictable effects. Bifurcations etc..
I agree with things needing to be learned; using the actual states themselves was more of a toy model (because we have mathematical models for MDPs but we don’t have mathematical models for “capabilities researchers will find something that can be Learned”), and I’d expect something else to happen. If I was to run off to implement this now, I’d be using learned embeddings of states, rather than states themselves. Though of course even learned embeddings have their problems.
The trouble with just saying “let’s use decision transformers” is twofold. First, we still need to actually define the feedback system. One option is to just define reward as the feedback, but as you mention, that’s not nuanced enough. You could use some system that’s trained to mimic human labels as the ground truth, but this kind of system has flaws for standard alignment reasons.
It seems to me that capabilities researchers are eventually going to find some clever feedback system to use. It will to a great extent be learned, but they’re going to need to figure out the learning method too.