Thanks for the thoughtful response (and willingness to bet)!
Do you think that your results would replicate if applied to DreamerV3 trained on the same task? That’s the kind of thing I had in mind as a “stronger” model, not just more parameters.
Dreamer is comprised of 3 networks (modeler, critic, and actor), but it is still a general reinforcement learning algorithm, so I think this is a relatively straightforward / fair comparison.
My prediction is no: specifically, given the same training environment, dreamer would converge more quickly to getting the cheese every time in training, and then find the cheese more often than 69% of the time in test (maybe much more, or always).
I’d be willing to bet on this, and / or put up a small bounty for you or someone else to actually run this experiment, if you have a different prediction.
(For anyone who wants to try this: the code for dreamer and the cheese task are available, and the cheese task is a gym, so it should be relatively straightforward to run this experiment.)
OTOH, if you agree with my prediction, I’d be interested in hearing why you think this isn’t a problem for shard theory. In general, I expect that DreamerV-N+1 and future SotA RL algorithms look even more like pure reward-function maximizers, given even less (and less general) training data.
Do you think that your results would replicate if applied to DreamerV3 trained on the same task?
Which results? Without having read more than a few figures from the paper:
I expect that DreamerV3 would be more sample efficient and so converge more quickly to getting the cheese every time during training
I expect that DreamerV3 would still train policy nets which don’t get the cheese all the time in test, although I wouldn’t be more surprised if it got it more than 69% of the time. I’d be somewhat surprised by >90%, and very surprised by >99%.
I think that, given the same architecture (deep conv net) and hyperparameter settings, probably the cheese vector replicates.
I think there’s also a good chance that the retargetability replicates (although using different channel numbers, of course, there was no grand reason why 55 was a cheese-tracking-channel).
Again, IDK this particular alg; I’m just imagining good model-based RL + exploration → a policy net. (If DreamerV3 does planning using its WM at inference time, that seems like a more substantially different story.)
If you think we disagree enough to have a bet here, lmk and I’ll read the alg more next week to give you some odds.
That’s the kind of thing I had in mind as a “stronger” model, not just more parameters.
...
In general, I expect that DreamerV-N+1 and future SotA RL algorithms look even more like pure reward-function maximizers, given even less (and less general) training data.
The most capable systems today (LLMs) don’t rely on super fancy RL algs—they often use PPO or some variant—and they get stronger in large part by getting more data and parameters.
However, it’s still interesting to understand how diff training processes produce diff kinds of behavior. So I think the experiment you propose is interesting, but I think shard theory-for-LLMs ultimately doesn’t gain or lose a ton either way.
Again, IDK this particular alg; I’m just imagining good model-based RL + exploration → a policy net. (If DreamerV3 does planning using its WM at inference time, that seems like a more substantially different story.)
The world model is a recurrent state space model, and the actor model takes the latent state of the world model as input. But there’s no tree search or other hand-coded exploration going on during inference.
I expect that DreamerV3 would still train policy nets which don’t get the cheese all the time in test, although I wouldn’t be more surprised if it got it more than 69% of the time. I’d be somewhat surprised by >90%, and very surprised by >99%.
This is the question I am most interested in. I’d bet at even odds that Dreamer-XL or even Dreamer-medium would get the cheese >90% of the time, and maybe 1:4 (20% implied probability) on >99%.
On other results:
Not sure if the recurrence makes some of the methods and results in the original post inapplicable or incomparable. But I do expect you can find cheese vectors and do analogous things with retargetability, perhaps by modifying the latent state of the world model, or by applying the techniques in the original post to the actor model.
I expect that many of the behavioral statistics detailed in this post mostly don’t replicate, primarily because p(cheese acquired) goes up dramatically. For episodes where the agent doesn’t get the cheese (if there are any), I’d be curious what they look like. I don’t have strong predictions here, but I wouldn’t be surprised if they look qualitatively different and are not well predicted by the three features here. I think some of the most interesting comparisons would be between mazes where both agents fail to get the cheese—do they end up in the same place, by the same path, for example?
Thanks for the thoughtful response (and willingness to bet)!
Do you think that your results would replicate if applied to DreamerV3 trained on the same task? That’s the kind of thing I had in mind as a “stronger” model, not just more parameters.
Dreamer is comprised of 3 networks (modeler, critic, and actor), but it is still a general reinforcement learning algorithm, so I think this is a relatively straightforward / fair comparison.
My prediction is no: specifically, given the same training environment, dreamer would converge more quickly to getting the cheese every time in training, and then find the cheese more often than 69% of the time in test (maybe much more, or always).
I’d be willing to bet on this, and / or put up a small bounty for you or someone else to actually run this experiment, if you have a different prediction.
(For anyone who wants to try this: the code for dreamer and the cheese task are available, and the cheese task is a gym, so it should be relatively straightforward to run this experiment.)
OTOH, if you agree with my prediction, I’d be interested in hearing why you think this isn’t a problem for shard theory. In general, I expect that DreamerV-N+1 and future SotA RL algorithms look even more like pure reward-function maximizers, given even less (and less general) training data.
Which results? Without having read more than a few figures from the paper:
I expect that DreamerV3 would be more sample efficient and so converge more quickly to getting the cheese every time during training
I expect that DreamerV3 would still train policy nets which don’t get the cheese all the time in test, although I wouldn’t be more surprised if it got it more than 69% of the time. I’d be somewhat surprised by >90%, and very surprised by >99%.
I think that, given the same architecture (deep conv net) and hyperparameter settings, probably the cheese vector replicates.
I think there’s also a good chance that the retargetability replicates (although using different channel numbers, of course, there was no grand reason why 55 was a cheese-tracking-channel).
Again, IDK this particular alg; I’m just imagining good model-based RL + exploration → a policy net. (If DreamerV3 does planning using its WM at inference time, that seems like a more substantially different story.)
If you think we disagree enough to have a bet here, lmk and I’ll read the alg more next week to give you some odds.
The most capable systems today (LLMs) don’t rely on super fancy RL algs—they often use PPO or some variant—and they get stronger in large part by getting more data and parameters.
However, it’s still interesting to understand how diff training processes produce diff kinds of behavior. So I think the experiment you propose is interesting, but I think shard theory-for-LLMs ultimately doesn’t gain or lose a ton either way.
The world model is a recurrent state space model, and the actor model takes the latent state of the world model as input. But there’s no tree search or other hand-coded exploration going on during inference.
This is the question I am most interested in. I’d bet at even odds that Dreamer-XL or even Dreamer-medium would get the cheese >90% of the time, and maybe 1:4 (20% implied probability) on >99%.
On other results:
Not sure if the recurrence makes some of the methods and results in the original post inapplicable or incomparable. But I do expect you can find cheese vectors and do analogous things with retargetability, perhaps by modifying the latent state of the world model, or by applying the techniques in the original post to the actor model.
I expect that many of the behavioral statistics detailed in this post mostly don’t replicate, primarily because p(cheese acquired) goes up dramatically. For episodes where the agent doesn’t get the cheese (if there are any), I’d be curious what they look like. I don’t have strong predictions here, but I wouldn’t be surprised if they look qualitatively different and are not well predicted by the three features here. I think some of the most interesting comparisons would be between mazes where both agents fail to get the cheese—do they end up in the same place, by the same path, for example?