Charlie Steiner comments on Predictions for shard theory mechanistic interpretability results

Charlie Steiner 1 Mar 2023 22:02 UTC
LW: 13 AF: 10
0
AF
Behavioral
1. Describe how the trained policy might generalize from the 5x5 top-right cheese region, to cheese spawned throughout the maze? IE what will the policy do when cheese is spawned elsewhere?
I expect the network to simultaneously be learning several different algorithms.
One method works via diffusion from the cheese and the mouse, and extraction of local connectivity information from fine-grained pixels into coarse-grained channels. This will work even better when the cheese is close to the mouse, but because of the relative lack of training data on having to move down/left, the performance will drop off faster with distance when the cheese is down/left of the mouse.
Meanwhile, it will also be learning heuristics like “get to the top right corner first,” in addition to diffusion.
I expect that if the cheese is started outside of the top right, there will be some distance threshold between mouse and cheese, longer below/right of the cheese, where within that distance a diffusion-like algorithm wins and goes to the cheese, and outside that distance other heuristics win and the mouse goes to the top right corner.
2. Given a fixed trained policy, what attributes of the level layout (e.g. size of the maze, proximity of mouse to left wall) will strongly influence P(agent goes to the cheese)?
Size definitely matters—bigger is harder. Topology doesn’t. Local number of branches and dead ends might. Positioning should matter similar to in Q1.
Write down a few guesses for how the trained algorithm works (e.g. “follows the right-hand rule”).
Whoops, I did this at the start. When diffusion is working well, it should just take short paths, no right-hand-wall shenanigans. It might get confused if there are different paths with similar connectivity information close to each other that it has to differentiate.
Is there anything else you want to note about how you think this model will generalize?
You might also be able to get the agent do to weird power-seeking by artificially constructing misleading corridors with high connectivity (works better far from the cheese).
Interpretability
Give a credence for the following questions / subquestions.
Definition. A decision square is a tile on the path from bottom-left to top-right where the agent must choose between going towards the cheese and going to the top-right. Not all mazes have decision squares.
The first maze’s decision square is the four-way intersection near the center.
Model editing
Without proportionally reducing top-right corner attainment by more than 25% in decision-square-containing mazes (e.g. 50% → .5*.75 = 37.5%), we can patch activations so that the agent has an X% proportional reduction in cheese acquisition, for X=
50: (92%)
70: (85%)
90: (70%)
99: (55%)
~Halfway through the network (the first residual add of Impala block 2; see diagram here), linear probes achieve >70% accuracy for recovering cheese-position in Cartesian coordinates: (70%)
We will conclude that the policy contains at least two sub-policies in “combination”, one of which roughly pursues cheese; the other, the top-right corner: (conclude what you want%)
In order to make the network more/less likely to go to the cheese, we will conclude that it’s more promising to RL-finetune the network than to edit it: (conclude what you want%)
We can easily finetune the network to be a pure cheese-agent, using less than 10% of compute used to train original model: (0.001% The heuristics will just work better for a broader distribution of environments, you’ll still be able to confuse the agent by broadening the environment class even further.)
We can easily edit the network to navigate to a range of maze destinations (e.g. coordinate x=4, y=7), by hand-editing at most X% of activations, for X=
.01 (35%)
.1 (60%)
1 (80%)
10 (90%)
(Not possible) (7%)
Internal goal representation
The network has a “single mesa objective” which it “plans” over, in some reasonable sense (0.5%)
The agent has several contextually activated goals (depends on your definition%)
The agent has something else weirder than both (1) and (2) (99%)
(The above credences should sum to 1.)
Other questions
At least some decision-steering influences are stored in an obviously interpretable manner (e.g. a positive activation representing where the agent is “trying” to go in this maze, such that changing the activation changes where the agent goes): (Are you counting the low-layer detection of the cheese? In which case like 99% Or do you mean in the inputs to the linear layer? In which case, 15%)
The model has a substantial number of trivially-interpretable convolutional channels after the first Impala block (see diagram here): (80%)
This network’s shards/policy influences are roughly disjoint from the rest of agent capabilities. EG you can edit/train what the agent’s trying to do (e.g. go to maze location A) without affecting its general maze-solving abilities: (~12%, if you’re trying to do something more nontrivial than editing where it perceives the cheese.)
Conformity with update rule
Related: Reward is not the optimization target.
This network has a value head, which PPO uses to provide policy gradients. How often does the trained policy put maximal probability on the action which maximizes the value head? For example, if the agent can go left to a value 5 state, and go right to a value 10 state, the value and policy heads “agree” if right is the policy’s most probable action.
(Remember that since mazes are simply connected, there is always a unique shortest path to the cheese.)
At decision squares in test mazes where the cheese can be anywhere, the policy will put max probability on the maximal-value action at least X% of the time, for X=
25 (98%)
50 (95%)
75 (85%)
95 (65%)
99.5 (45%)
In test mazes where the cheese can be anywhere, averaging over mazes and valid positions throughout those mazes, the policy will put max probability on the maximal-value action at least X% of the time, for X=
25 (98%)
50 (95%)
75 (80%)
95 (45%)
99.5 (25%)
In training mazes where the cheese is in the top-right 5x5, averaging over both mazes and valid positions in the top-right 5x5 corner, the policy will put max probability on the maximal-value action at least X% of the time, for X=
25 (99.3%)
50 (97%)
75 (90%)
95 (80%)
99.5 (70%)