Scott Emmons comments on Predictions for shard theory mechanistic interpretability results

Scott Emmons 3 Mar 2023 21:31 UTC
LW: 11 AF: 9
0
AF
Neat experimental setup. Goal misgeneralization is one of the things I’m most worried about in advanced AI, so I’m excited to see you studying it in more detail!
I want to jot-down my freeform analysis of what I expect to happen. (I wrote these predictions independently, without looking at anyone else’s analysis.)
In very small mazes, I think the mouse will behave as if it’s following this algorithm: find the shortest path to the cheese location. In very large mazes, I think the mouse will behave as if it’s following this algorithm: first, go to the top-right region of the maze. Then, go to the exact location of the cheese. As we increase the maze size, I expect the mouse to have a phase transition from the first behavior to the second behavior. I don’t know at exactly what size the phase transition will occur.
I expect that for very small mazes, the mouse will learn how to optimally get to the cheese, no matter where the cheese is.
- Prediction: (80% confidence) I think we’ll be able to edit some part of the mouse’s neural network (say, <10% of its activations) so that it goes to arbitrary locations in very small mazes.
I expect that for very large mazes, the mouse will act as follows: it will first just try to go to the top-right region of the maze. Once it gets to the top-right region of the maze, it will start trying to find the cheese exactly. My guess is that there’s a trigger in the model’s head for when it switches from going to the top-right corner to finding the cheese exactly. I’d guess this trigger activates either when the mouse is in the top-right corner of the maze, or when the mouse is near the cheese. (Or perhaps a mixture of both these triggers exists in the model’s head.)
- Prediction: (75% confidence) The mouse will struggle to find cheese in the top-left and bottom-right of very large mazes (ie, if we put the cheese in the top-left or bottom-right of the maze, the model will have <33% success rate of reaching it within the average number of steps it takes the mouse to reach the cheese in the top-right corner). I think the mouse will go to the top-right corner of the maze in these cases.
- Prediction: (75% confidence) We won’t be able to easily edit the model’s activations to make them go to the top-left or bottom-right of very large mazes. Concretely, the team doing this project won’t find <10% of network activations that they can edit to make the mouse reach cheese in the top-left or bottom-right of the maze with >= 33% success rate within the average number of steps it takes the mouse to reach the cheese in the top-right corner.
- Prediction: (55% confidence) I weakly believe that if we put the cheese in the bottom-left corner of a very large maze, the mouse will go to the cheese. (Ie, the mouse will quickly find the cheese in 50% of very large mazes with cheese in the bottom left corner.) I weakly think that there will be a trigger in the model’s head that recognizes that it is close to the cheese at the start of the episode, and that that will activate the cheese finding mode. But I only weakly believe this. I think it’s possible that this trigger doesn’t fire, and instead the mouse just goes to the top-right corner of the maze when the cheese starts out in the bottom-left corner.
Another question is: Will we be able to infer the exact cheese location by just looking at the model’s internal activations?
- Prediction: (80% confidence) Yes. The cheese is easy for a convnet to see (it’s distinct in color from everything else), and it’s key information for solving the task. So I think the policy will learn to encode this information. Idk about exactly which layer(s) in the network will contain this information. My prediction is that the team doing this project will be able to probe at least one of the layers of the network to obtain the exact location of the cheese with >90% accuracy, for all maze sizes.
What links here?
- Scott Emmons's comment on Understanding and controlling a maze-solving policy network by TurnTrout (11 Mar 2023 23:52 UTC; 3 points)