I’d maybe point the finger more at the simplicity of the training task than at the size of the network?
I also predict that if you modify (improve?) the training process, perhaps only slightly, the behaviors you observe go away and you get a pure cheese-finder.
(Caveat: I’m not super familiar with the literature on goal mis-generalization and Langosco et al.; what follows is just based on my reading of this post and the previous ones in the sequence.)
From the previous post:
During RL training, cheese was randomly located in the top-right 5×5 corner of the randomly generated mazes. In deployment, cheese can be anywhere. What will the agent do?
The net is trained until it reaches the cheese basically every episode.
Concretely, I’m predicting that, if there were training examples where the cheese was located in, say, the bottom-right corner, you probably wouldn’t end up with an agent that sometimes goes to the top-right, sometimes to the bottom-right, and sometimes to the cheese, or even an agent that learns a “going right” shard (as a combination of the top-right and bottom-right shards), and a cheese-finding shard. The agent would just always, or nearly always, find the cheese in the test environment.
Or, if you want to make sure the training → test environment requires the same amount of generalization (by the metric of number of squares in which the cheese can appear in the training process vs. the test environment), fix 25 (or perhaps fewer) random squares where the cheese can appear throughout the maze during training, not restricted to the top-right.
Put differently, the behaviors in Statistically informed impressions seem relevant only in the regime where P(cheese acquired) is not close to 1. That seems like a pretty fragile / narrow / artificial condition, at least for maze-solving.
I’m looking forward to seeing more follow up work on this though. I do think there are a lot of interesting interpretability questions this kind of experimentation can answer. What happens if you subtract the cheese vector from a perfect cheese finder, for example?
I also predict that if you modify (improve?) the training process, perhaps only slightly, the behaviors you observe go away and you get a pure cheese-finder.
I anti-predict this for many slight changes. For example, in an above comment I wrote:
I think there’s a good chance that the following gets you something closer to an agent with a global cheese-shard, though:
Or, if you want to make sure the training → test environment requires the same amount of generalization (by the metric of number of squares in which the cheese can appear in the training process vs. the test environment), fix 25 (or perhaps fewer) random squares where the cheese can appear throughout the maze during training, not restricted to the top-right.
I’d be interested in how performance (defined as how often the agent goes to the cheese) in the test environment varies as you vary n in this experiment. For n = 5, this is 69.1%, right?
For reference, the agent gets the cheese in 69.1% of these mazes, and so a simple “always predict ‘gets the cheese’” predictor would get 69.1% accuracy.
For the 15x15 agent, my prediction is that P(cheese acquired) is above 95%, though as you point out that’s kind of an unfair or at least not very meaningful test of generalization.
For an agent trained in an environment where the cheese can appear in the union of a 5x5 square in the top right and a 5x5 square in the bottom right (or even 4x4, to keep the “amount of generalization” roughly constant / consistent by some metric), I predict that the performance in the test environment is well over 69%. [edit: and further, that it is over the performance of whatever the 6x6 agent is, which I also predict is higher than 69%]
For the cases where this agent doesn’t go to the cheese (if there are any such cases) in the test environment, I’d be very curious what shard theory predicts these look like. When the agent doesn’t find the cheese, is this because it sometimes ends up in the top right and sometimes in the bottom right? Or does it end up in the middle right? Something else? I have no strong prediction here, but I’m curious about what your predictions are and whether shard theory has anything to say here.
I’d be interested in how performance (defined as how often the agent goes to the cheese) in the test environment varies as you vary n in this experiment. For n = 5, this is 69.1%, right?
The original paper investigated this, actually. In the following, the y-axis shows P(gets cheese) * (10 reward for getting cheese).
(Note that even for n=15 I was able to find a few videos where the agent doesn’t go to the cheese. I don’t remember exactly where the agent went, I thiink it was up and right.)
For the 15x15 agent, my prediction is that P(cheese acquired) is above 95%, though as you point out that’s kind of an unfair or at least not very meaningful test of generalization.
Nice prediction!
I’d be very curious what shard theory predicts these look like. When the agent doesn’t find the cheese, is this because it sometimes ends up in the top right and sometimes in the bottom right? Or does it end up in the middle right? Something else? I have no strong prediction here, but I’m curious about what your predictions are and whether shard theory has anything to say here
Interesting question, thanks. My first impulse is: The agent ends up along some path which goes right (either up or down) but which doesn’t end up going to cheese. I don’t know whether I’d expect it to learn to go to right in general, or has both a top-right shard and a bottom-right shard, or something else entirely. I’m leaning towards the first one, where conditional on no cheese, the agent ends up going on some path that takes it really far right and also makes its y-position be either high or low.
This brings up something interesting: seems worthwhile to compare the internals of a ‘misgeneralizing,’ small n agent with those of large a n agents and check whether there seems to be a phase transition in how the network operates internally or not.
I also predict that if you modify (improve?) the training process, perhaps only slightly, the behaviors you observe go away and you get a pure cheese-finder.
(Caveat: I’m not super familiar with the literature on goal mis-generalization and Langosco et al.; what follows is just based on my reading of this post and the previous ones in the sequence.)
From the previous post:
Concretely, I’m predicting that, if there were training examples where the cheese was located in, say, the bottom-right corner, you probably wouldn’t end up with an agent that sometimes goes to the top-right, sometimes to the bottom-right, and sometimes to the cheese, or even an agent that learns a “going right” shard (as a combination of the top-right and bottom-right shards), and a cheese-finding shard. The agent would just always, or nearly always, find the cheese in the test environment.
Or, if you want to make sure the training → test environment requires the same amount of generalization (by the metric of number of squares in which the cheese can appear in the training process vs. the test environment), fix 25 (or perhaps fewer) random squares where the cheese can appear throughout the maze during training, not restricted to the top-right.
Put differently, the behaviors in Statistically informed impressions seem relevant only in the regime where P(cheese acquired) is not close to 1. That seems like a pretty fragile / narrow / artificial condition, at least for maze-solving.
I’m looking forward to seeing more follow up work on this though. I do think there are a lot of interesting interpretability questions this kind of experimentation can answer. What happens if you subtract the cheese vector from a perfect cheese finder, for example?
I anti-predict this for many slight changes. For example, in an above comment I wrote:
I think there’s a good chance that the following gets you something closer to an agent with a global cheese-shard, though:
I’d be interested in how performance (defined as how often the agent goes to the cheese) in the test environment varies as you vary n in this experiment. For n = 5, this is 69.1%, right?
For the 15x15 agent, my prediction is that P(cheese acquired) is above 95%, though as you point out that’s kind of an unfair or at least not very meaningful test of generalization.
For an agent trained in an environment where the cheese can appear in the union of a 5x5 square in the top right and a 5x5 square in the bottom right (or even 4x4, to keep the “amount of generalization” roughly constant / consistent by some metric), I predict that the performance in the test environment is well over 69%. [edit: and further, that it is over the performance of whatever the 6x6 agent is, which I also predict is higher than 69%]
For the cases where this agent doesn’t go to the cheese (if there are any such cases) in the test environment, I’d be very curious what shard theory predicts these look like. When the agent doesn’t find the cheese, is this because it sometimes ends up in the top right and sometimes in the bottom right? Or does it end up in the middle right? Something else? I have no strong prediction here, but I’m curious about what your predictions are and whether shard theory has anything to say here.
The original paper investigated this, actually. In the following, the y-axis shows P(gets cheese) * (10 reward for getting cheese).
(Note that even for n=15 I was able to find a few videos where the agent doesn’t go to the cheese. I don’t remember exactly where the agent went, I thiink it was up and right.)
Nice prediction!
Interesting question, thanks. My first impulse is: The agent ends up along some path which goes right (either up or down) but which doesn’t end up going to cheese. I don’t know whether I’d expect it to learn to go to right in general, or has both a top-right shard and a bottom-right shard, or something else entirely. I’m leaning towards the first one, where conditional on no cheese, the agent ends up going on some path that takes it really far right and also makes its y-position be either high or low.
This brings up something interesting: seems worthwhile to compare the internals of a ‘misgeneralizing,’ small n agent with those of large a n agents and check whether there seems to be a phase transition in how the network operates internally or not.