Do you think that a larger network, a different architecture, and sufficient RL would be capable of learning an explicit representation of a more “consequentialist” algorithm for maze-solving, like Dijkstra’s algorithm or A*?
Or do you think that this is ruled out, absent a radically different training process or system architecture?
I’m not sure about the answers to these questions, but I think that if you could get a big enough RL-trained network to learn and use an explicit path-finding algorithm, that network would be better at getting the cheese, possibly exhibiting a discontinuous performance jump and change in behavior.
Or, put another way: this is interesting interpretability work, but the model itself seems too weak to draw many conclusions about whether the decision-making processes of smarter agents will be more consequentialist (“utility-theoretic?” not sure if we’re talking about precisely the same thing here) or more shard-like.
My own prediction is that a consequentialist model of the system as a cheese-finder becomes a simpler and more predictively-accurate model of behavior than shard theory as performance improves. I’ll make the stronger but less concrete prediction that the consequentialist view begins to outperform the shard view in predictive accuracy at or even below human-level capabilities (in general, not just for maze solving).
Do you think that a larger network, a different architecture, and sufficient RL would be capable of learning an explicit representation of a more “consequentialist” algorithm for maze-solving, like Dijkstra’s algorithm or A*?
Or do you think that this is ruled out, absent a radically different training process or system architecture?
This is a good question to think about. I think this possibility is basically ruled out, unless you change the architecture quite a bit. Search is very awkward to represent in deep conv nets, AFAICT.
Concretely, I think these models are plenty “strong” at this task:
The networks are trained until they get the cheese in nearly every episode.
The model we studied is at least 4x overparameterized for the training task. Langosco et al. trained a model which has a quarter the channels at each layer in the network. This network also converges to getting the cheese every time.
Uli retrained his own networks (same architecture) and found them to exhibit similar behavior.
I don’t think there are commensurably good reasons to think this model is too weak. More speculatively, I don’t think people would have predicted in advance that it’s too weak, either. For example, before even considering any experiments, I watched the trajectory videos and looked at the reward curves in the paper. I personally did not think the network was too weak. It seemed very capable to me, and I still think it is.
consequentialist
I think shard agents can be very consequentialist. There’s a difference between “making decisions on the basis of modelled consequences” and “implementing a relatively crisp search procedure to optimize a fixed-across-situations objective function.”[1] “Utility-theoretic” is more the latter, of the agent being well-modelled as doing the latter.
I’ll make the stronger but less concrete prediction that the consequentialist view begins to outperform the shard view in predictive accuracy at or even below human-level capabilities (in general, not just for maze solving).
We run into obstacles here because this is probably not what some people mean when they use utility-theoretic. I sometimes don’t know what internal motivational structures people are positing by “utility-theoretic”, and am happy to consider the predictions made by any specific alternative grounding.
Thanks for the thoughtful response (and willingness to bet)!
Do you think that your results would replicate if applied to DreamerV3 trained on the same task? That’s the kind of thing I had in mind as a “stronger” model, not just more parameters.
Dreamer is comprised of 3 networks (modeler, critic, and actor), but it is still a general reinforcement learning algorithm, so I think this is a relatively straightforward / fair comparison.
My prediction is no: specifically, given the same training environment, dreamer would converge more quickly to getting the cheese every time in training, and then find the cheese more often than 69% of the time in test (maybe much more, or always).
I’d be willing to bet on this, and / or put up a small bounty for you or someone else to actually run this experiment, if you have a different prediction.
(For anyone who wants to try this: the code for dreamer and the cheese task are available, and the cheese task is a gym, so it should be relatively straightforward to run this experiment.)
OTOH, if you agree with my prediction, I’d be interested in hearing why you think this isn’t a problem for shard theory. In general, I expect that DreamerV-N+1 and future SotA RL algorithms look even more like pure reward-function maximizers, given even less (and less general) training data.
Do you think that your results would replicate if applied to DreamerV3 trained on the same task?
Which results? Without having read more than a few figures from the paper:
I expect that DreamerV3 would be more sample efficient and so converge more quickly to getting the cheese every time during training
I expect that DreamerV3 would still train policy nets which don’t get the cheese all the time in test, although I wouldn’t be more surprised if it got it more than 69% of the time. I’d be somewhat surprised by >90%, and very surprised by >99%.
I think that, given the same architecture (deep conv net) and hyperparameter settings, probably the cheese vector replicates.
I think there’s also a good chance that the retargetability replicates (although using different channel numbers, of course, there was no grand reason why 55 was a cheese-tracking-channel).
Again, IDK this particular alg; I’m just imagining good model-based RL + exploration → a policy net. (If DreamerV3 does planning using its WM at inference time, that seems like a more substantially different story.)
If you think we disagree enough to have a bet here, lmk and I’ll read the alg more next week to give you some odds.
That’s the kind of thing I had in mind as a “stronger” model, not just more parameters.
...
In general, I expect that DreamerV-N+1 and future SotA RL algorithms look even more like pure reward-function maximizers, given even less (and less general) training data.
The most capable systems today (LLMs) don’t rely on super fancy RL algs—they often use PPO or some variant—and they get stronger in large part by getting more data and parameters.
However, it’s still interesting to understand how diff training processes produce diff kinds of behavior. So I think the experiment you propose is interesting, but I think shard theory-for-LLMs ultimately doesn’t gain or lose a ton either way.
Again, IDK this particular alg; I’m just imagining good model-based RL + exploration → a policy net. (If DreamerV3 does planning using its WM at inference time, that seems like a more substantially different story.)
The world model is a recurrent state space model, and the actor model takes the latent state of the world model as input. But there’s no tree search or other hand-coded exploration going on during inference.
I expect that DreamerV3 would still train policy nets which don’t get the cheese all the time in test, although I wouldn’t be more surprised if it got it more than 69% of the time. I’d be somewhat surprised by >90%, and very surprised by >99%.
This is the question I am most interested in. I’d bet at even odds that Dreamer-XL or even Dreamer-medium would get the cheese >90% of the time, and maybe 1:4 (20% implied probability) on >99%.
On other results:
Not sure if the recurrence makes some of the methods and results in the original post inapplicable or incomparable. But I do expect you can find cheese vectors and do analogous things with retargetability, perhaps by modifying the latent state of the world model, or by applying the techniques in the original post to the actor model.
I expect that many of the behavioral statistics detailed in this post mostly don’t replicate, primarily because p(cheese acquired) goes up dramatically. For episodes where the agent doesn’t get the cheese (if there are any), I’d be curious what they look like. I don’t have strong predictions here, but I wouldn’t be surprised if they look qualitatively different and are not well predicted by the three features here. I think some of the most interesting comparisons would be between mazes where both agents fail to get the cheese—do they end up in the same place, by the same path, for example?
I’d maybe point the finger more at the simplicity of the training task than at the size of the network? I’m not sure there’s strong reason to believe the network is underparameterized for the training task. But I agree that drawing lessons from small-ish networks trained on simple tasks requires caution.
I’d maybe point the finger more at the simplicity of the training task than at the size of the network?
I also predict that if you modify (improve?) the training process, perhaps only slightly, the behaviors you observe go away and you get a pure cheese-finder.
(Caveat: I’m not super familiar with the literature on goal mis-generalization and Langosco et al.; what follows is just based on my reading of this post and the previous ones in the sequence.)
From the previous post:
During RL training, cheese was randomly located in the top-right 5×5 corner of the randomly generated mazes. In deployment, cheese can be anywhere. What will the agent do?
The net is trained until it reaches the cheese basically every episode.
Concretely, I’m predicting that, if there were training examples where the cheese was located in, say, the bottom-right corner, you probably wouldn’t end up with an agent that sometimes goes to the top-right, sometimes to the bottom-right, and sometimes to the cheese, or even an agent that learns a “going right” shard (as a combination of the top-right and bottom-right shards), and a cheese-finding shard. The agent would just always, or nearly always, find the cheese in the test environment.
Or, if you want to make sure the training → test environment requires the same amount of generalization (by the metric of number of squares in which the cheese can appear in the training process vs. the test environment), fix 25 (or perhaps fewer) random squares where the cheese can appear throughout the maze during training, not restricted to the top-right.
Put differently, the behaviors in Statistically informed impressions seem relevant only in the regime where P(cheese acquired) is not close to 1. That seems like a pretty fragile / narrow / artificial condition, at least for maze-solving.
I’m looking forward to seeing more follow up work on this though. I do think there are a lot of interesting interpretability questions this kind of experimentation can answer. What happens if you subtract the cheese vector from a perfect cheese finder, for example?
I also predict that if you modify (improve?) the training process, perhaps only slightly, the behaviors you observe go away and you get a pure cheese-finder.
I anti-predict this for many slight changes. For example, in an above comment I wrote:
I think there’s a good chance that the following gets you something closer to an agent with a global cheese-shard, though:
Or, if you want to make sure the training → test environment requires the same amount of generalization (by the metric of number of squares in which the cheese can appear in the training process vs. the test environment), fix 25 (or perhaps fewer) random squares where the cheese can appear throughout the maze during training, not restricted to the top-right.
I’d be interested in how performance (defined as how often the agent goes to the cheese) in the test environment varies as you vary n in this experiment. For n = 5, this is 69.1%, right?
For reference, the agent gets the cheese in 69.1% of these mazes, and so a simple “always predict ‘gets the cheese’” predictor would get 69.1% accuracy.
For the 15x15 agent, my prediction is that P(cheese acquired) is above 95%, though as you point out that’s kind of an unfair or at least not very meaningful test of generalization.
For an agent trained in an environment where the cheese can appear in the union of a 5x5 square in the top right and a 5x5 square in the bottom right (or even 4x4, to keep the “amount of generalization” roughly constant / consistent by some metric), I predict that the performance in the test environment is well over 69%. [edit: and further, that it is over the performance of whatever the 6x6 agent is, which I also predict is higher than 69%]
For the cases where this agent doesn’t go to the cheese (if there are any such cases) in the test environment, I’d be very curious what shard theory predicts these look like. When the agent doesn’t find the cheese, is this because it sometimes ends up in the top right and sometimes in the bottom right? Or does it end up in the middle right? Something else? I have no strong prediction here, but I’m curious about what your predictions are and whether shard theory has anything to say here.
I’d be interested in how performance (defined as how often the agent goes to the cheese) in the test environment varies as you vary n in this experiment. For n = 5, this is 69.1%, right?
The original paper investigated this, actually. In the following, the y-axis shows P(gets cheese) * (10 reward for getting cheese).
(Note that even for n=15 I was able to find a few videos where the agent doesn’t go to the cheese. I don’t remember exactly where the agent went, I thiink it was up and right.)
For the 15x15 agent, my prediction is that P(cheese acquired) is above 95%, though as you point out that’s kind of an unfair or at least not very meaningful test of generalization.
Nice prediction!
I’d be very curious what shard theory predicts these look like. When the agent doesn’t find the cheese, is this because it sometimes ends up in the top right and sometimes in the bottom right? Or does it end up in the middle right? Something else? I have no strong prediction here, but I’m curious about what your predictions are and whether shard theory has anything to say here
Interesting question, thanks. My first impulse is: The agent ends up along some path which goes right (either up or down) but which doesn’t end up going to cheese. I don’t know whether I’d expect it to learn to go to right in general, or has both a top-right shard and a bottom-right shard, or something else entirely. I’m leaning towards the first one, where conditional on no cheese, the agent ends up going on some path that takes it really far right and also makes its y-position be either high or low.
This brings up something interesting: seems worthwhile to compare the internals of a ‘misgeneralizing,’ small n agent with those of large a n agents and check whether there seems to be a phase transition in how the network operates internally or not.
Do you think that a larger network, a different architecture, and sufficient RL would be capable of learning an explicit representation of a more “consequentialist” algorithm for maze-solving, like Dijkstra’s algorithm or A*?
Or do you think that this is ruled out, absent a radically different training process or system architecture?
I’m not sure about the answers to these questions, but I think that if you could get a big enough RL-trained network to learn and use an explicit path-finding algorithm, that network would be better at getting the cheese, possibly exhibiting a discontinuous performance jump and change in behavior.
Or, put another way: this is interesting interpretability work, but the model itself seems too weak to draw many conclusions about whether the decision-making processes of smarter agents will be more consequentialist (“utility-theoretic?” not sure if we’re talking about precisely the same thing here) or more shard-like.
My own prediction is that a consequentialist model of the system as a cheese-finder becomes a simpler and more predictively-accurate model of behavior than shard theory as performance improves. I’ll make the stronger but less concrete prediction that the consequentialist view begins to outperform the shard view in predictive accuracy at or even below human-level capabilities (in general, not just for maze solving).
This is a good question to think about. I think this possibility is basically ruled out, unless you change the architecture quite a bit. Search is very awkward to represent in deep conv nets, AFAICT.
Concretely, I think these models are plenty “strong” at this task:
The networks are trained until they get the cheese in nearly every episode.
The model we studied is at least 4x overparameterized for the training task. Langosco et al. trained a model which has a quarter the channels at each layer in the network. This network also converges to getting the cheese every time.
Our e.g. cheese-vector analysis qualitatively holds for a range of agents with different training distributions (trained with cheese in the top-right nxn corner, for n=2,...,15). Inspecting the vector fields, they are all e.g. locally attracted by cheese. Even the 15x15 agent goes to the top-right corner sometimes! (Levels are at most 25x25; a 15x15 zone is, in many maze sizes, equivalent to “the cheese can be anywhere.”)
Uli retrained his own networks (same architecture) and found them to exhibit similar behavior.
I don’t think there are commensurably good reasons to think this model is too weak. More speculatively, I don’t think people would have predicted in advance that it’s too weak, either. For example, before even considering any experiments, I watched the trajectory videos and looked at the reward curves in the paper. I personally did not think the network was too weak. It seemed very capable to me, and I still think it is.
I think shard agents can be very consequentialist. There’s a difference between “making decisions on the basis of modelled consequences” and “implementing a relatively crisp search procedure to optimize a fixed-across-situations objective function.”[1] “Utility-theoretic” is more the latter, of the agent being well-modelled as doing the latter.
I think this is wrong, and am willing to make a bet here if you want. I think that even e.g. GPT-4 is not well-predicted by a consequentialist view, and is well-predicted by modelling it as having shards. I also think the shard view outperforms the consequentialist view for humans themselves, who I currently think are relatively learned-from-scratch and trained via somewhat (but not perfectly) similar training processes.
We run into obstacles here because this is probably not what some people mean when they use utility-theoretic. I sometimes don’t know what internal motivational structures people are positing by “utility-theoretic”, and am happy to consider the predictions made by any specific alternative grounding.
Thanks for the thoughtful response (and willingness to bet)!
Do you think that your results would replicate if applied to DreamerV3 trained on the same task? That’s the kind of thing I had in mind as a “stronger” model, not just more parameters.
Dreamer is comprised of 3 networks (modeler, critic, and actor), but it is still a general reinforcement learning algorithm, so I think this is a relatively straightforward / fair comparison.
My prediction is no: specifically, given the same training environment, dreamer would converge more quickly to getting the cheese every time in training, and then find the cheese more often than 69% of the time in test (maybe much more, or always).
I’d be willing to bet on this, and / or put up a small bounty for you or someone else to actually run this experiment, if you have a different prediction.
(For anyone who wants to try this: the code for dreamer and the cheese task are available, and the cheese task is a gym, so it should be relatively straightforward to run this experiment.)
OTOH, if you agree with my prediction, I’d be interested in hearing why you think this isn’t a problem for shard theory. In general, I expect that DreamerV-N+1 and future SotA RL algorithms look even more like pure reward-function maximizers, given even less (and less general) training data.
Which results? Without having read more than a few figures from the paper:
I expect that DreamerV3 would be more sample efficient and so converge more quickly to getting the cheese every time during training
I expect that DreamerV3 would still train policy nets which don’t get the cheese all the time in test, although I wouldn’t be more surprised if it got it more than 69% of the time. I’d be somewhat surprised by >90%, and very surprised by >99%.
I think that, given the same architecture (deep conv net) and hyperparameter settings, probably the cheese vector replicates.
I think there’s also a good chance that the retargetability replicates (although using different channel numbers, of course, there was no grand reason why 55 was a cheese-tracking-channel).
Again, IDK this particular alg; I’m just imagining good model-based RL + exploration → a policy net. (If DreamerV3 does planning using its WM at inference time, that seems like a more substantially different story.)
If you think we disagree enough to have a bet here, lmk and I’ll read the alg more next week to give you some odds.
The most capable systems today (LLMs) don’t rely on super fancy RL algs—they often use PPO or some variant—and they get stronger in large part by getting more data and parameters.
However, it’s still interesting to understand how diff training processes produce diff kinds of behavior. So I think the experiment you propose is interesting, but I think shard theory-for-LLMs ultimately doesn’t gain or lose a ton either way.
The world model is a recurrent state space model, and the actor model takes the latent state of the world model as input. But there’s no tree search or other hand-coded exploration going on during inference.
This is the question I am most interested in. I’d bet at even odds that Dreamer-XL or even Dreamer-medium would get the cheese >90% of the time, and maybe 1:4 (20% implied probability) on >99%.
On other results:
Not sure if the recurrence makes some of the methods and results in the original post inapplicable or incomparable. But I do expect you can find cheese vectors and do analogous things with retargetability, perhaps by modifying the latent state of the world model, or by applying the techniques in the original post to the actor model.
I expect that many of the behavioral statistics detailed in this post mostly don’t replicate, primarily because p(cheese acquired) goes up dramatically. For episodes where the agent doesn’t get the cheese (if there are any), I’d be curious what they look like. I don’t have strong predictions here, but I wouldn’t be surprised if they look qualitatively different and are not well predicted by the three features here. I think some of the most interesting comparisons would be between mazes where both agents fail to get the cheese—do they end up in the same place, by the same path, for example?
I’d maybe point the finger more at the simplicity of the training task than at the size of the network? I’m not sure there’s strong reason to believe the network is underparameterized for the training task. But I agree that drawing lessons from small-ish networks trained on simple tasks requires caution.
I also predict that if you modify (improve?) the training process, perhaps only slightly, the behaviors you observe go away and you get a pure cheese-finder.
(Caveat: I’m not super familiar with the literature on goal mis-generalization and Langosco et al.; what follows is just based on my reading of this post and the previous ones in the sequence.)
From the previous post:
Concretely, I’m predicting that, if there were training examples where the cheese was located in, say, the bottom-right corner, you probably wouldn’t end up with an agent that sometimes goes to the top-right, sometimes to the bottom-right, and sometimes to the cheese, or even an agent that learns a “going right” shard (as a combination of the top-right and bottom-right shards), and a cheese-finding shard. The agent would just always, or nearly always, find the cheese in the test environment.
Or, if you want to make sure the training → test environment requires the same amount of generalization (by the metric of number of squares in which the cheese can appear in the training process vs. the test environment), fix 25 (or perhaps fewer) random squares where the cheese can appear throughout the maze during training, not restricted to the top-right.
Put differently, the behaviors in Statistically informed impressions seem relevant only in the regime where P(cheese acquired) is not close to 1. That seems like a pretty fragile / narrow / artificial condition, at least for maze-solving.
I’m looking forward to seeing more follow up work on this though. I do think there are a lot of interesting interpretability questions this kind of experimentation can answer. What happens if you subtract the cheese vector from a perfect cheese finder, for example?
I anti-predict this for many slight changes. For example, in an above comment I wrote:
I think there’s a good chance that the following gets you something closer to an agent with a global cheese-shard, though:
I’d be interested in how performance (defined as how often the agent goes to the cheese) in the test environment varies as you vary n in this experiment. For n = 5, this is 69.1%, right?
For the 15x15 agent, my prediction is that P(cheese acquired) is above 95%, though as you point out that’s kind of an unfair or at least not very meaningful test of generalization.
For an agent trained in an environment where the cheese can appear in the union of a 5x5 square in the top right and a 5x5 square in the bottom right (or even 4x4, to keep the “amount of generalization” roughly constant / consistent by some metric), I predict that the performance in the test environment is well over 69%. [edit: and further, that it is over the performance of whatever the 6x6 agent is, which I also predict is higher than 69%]
For the cases where this agent doesn’t go to the cheese (if there are any such cases) in the test environment, I’d be very curious what shard theory predicts these look like. When the agent doesn’t find the cheese, is this because it sometimes ends up in the top right and sometimes in the bottom right? Or does it end up in the middle right? Something else? I have no strong prediction here, but I’m curious about what your predictions are and whether shard theory has anything to say here.
The original paper investigated this, actually. In the following, the y-axis shows P(gets cheese) * (10 reward for getting cheese).
(Note that even for n=15 I was able to find a few videos where the agent doesn’t go to the cheese. I don’t remember exactly where the agent went, I thiink it was up and right.)
Nice prediction!
Interesting question, thanks. My first impulse is: The agent ends up along some path which goes right (either up or down) but which doesn’t end up going to cheese. I don’t know whether I’d expect it to learn to go to right in general, or has both a top-right shard and a bottom-right shard, or something else entirely. I’m leaning towards the first one, where conditional on no cheese, the agent ends up going on some path that takes it really far right and also makes its y-position be either high or low.
This brings up something interesting: seems worthwhile to compare the internals of a ‘misgeneralizing,’ small n agent with those of large a n agents and check whether there seems to be a phase transition in how the network operates internally or not.