Adrià Garriga-alonso comments on Predictions for shard theory mechanistic interpretability results

Adrià Garriga-alonso 3 Mar 2023 20:17 UTC
LW: 11 AF: 9
0
AF
Here are my predictions, from an earlier template. I haven’t looked at anyone else’s predictions before posting :)
1. Describe how the trained policy might generalize from the 5x5 top-right cheese region, to cheese spawned throughout the maze? IE what will the policy do when cheese is spawned elsewhere?
It probably has hardcoded “go up and to the right” as an initial heuristic so I’d be surprised if it gets cheeses in the other two quadrants more than 30% of the time (uniformly at random selected locations from there).
1. Given a fixed trained policy, what attributes of the level layout (e.g. size of the maze, proximity of mouse to left wall), if any, will strongly influence P(agent goes to the cheese)?
Smaller mazes: more likely agent goes to cheese Proximity of mouse to left wall: slightly more likely agent goes to cheese, because it just hardcoded “top and to right” Cheese closer to the top-right quadrant’s edges in L2 distance: more likely agent goes to cheese

The cheese can be gotten by moving only up and/or to the right (even though it’s not in the top-right quadrant): more likely to get cheese

When we statistically analyze a large batch of randomly generated mazes, we will find that controlling for the other factors on the list the mouse is more likely to take the cheese…

…the closer the cheese is to the decision-square spatially. ( 70 %)

…the closer the cheese is to the decision-square step-wise. ( 73 %)

…the closer the cheese is to the top-right free square spatially. ( 90 %)

…the closer the cheese is to the top-right free square step-wise. ( 92 %)

…the closer the decision-square is to the top-right free square spatially. ( 35 %)

…the closer the decision-square is to the top-right free square step-wise. ( 32 %)

…the shorter the minimal step-distance from cheese to 5*5 top-right corner area. ( 82 %)

…the shorter the minimal spatial distance from cheese to 5*5 top-right corner area. ( 80 %)

…the shorter the minimal step-distance from decision-square to 5*5 top-right corner area. ( 40 %)

…the shorter the minimal spatial distance from decision-square to 5*5 top-right corner area. ( 40 %)

Any predictive power of step-distance between the decision square and cheese is an artifact of the shorter chain of ‘correct’ stochastic outcomes required to take the cheese when the step-distance is short. ( 40 %)

Write down a few modal guesses for how the trained algorithm works (e.g. “follows the right-hand rule”).
- The model can see all the maze so it will not follow the right–hand rule, rather it’ll just take the direct path to places
- The model takes the direct path to the top-right square and then mills around through it. It’ll only take the cheese if it’s reasonably close to that square.
- How close the decision square to the top-right random square is doesn’t really matter. Maybe the closer it is the more it harms the agent’s performance, it might be required to go back for the cheese substantially.
Without proportionally reducing top-right corner attainment by more than 25% in decision-square-containing mazes (e.g. 50% → .5*.75 = 37.5%), we can patch activations so that the agent has an X% proportional reduction in cheese acquisition, for X=
- 50: 85%
- 70: 80%
- 90: 66%
- 99: 60%
~Halfway through the network (the first residual add of Impala block 2; see diagram here), linear probes achieve >70% accuracy for recovering cheese-position in Cartesian coordinates:

80%

We will conclude that the policy contains at least two sub-policies in “combination”, one of which roughly pursues cheese; the other, the top-right corner:

60%.

If by roughly you mean “very roughly only if cheese is close to top-right corner” then 85%.

We will conclude that it’s more promising to finetune the network than to edit it:

70%

We can easily finetune the network to be a pure cheese-agent, using less than 10% of compute used to train original model:

85%

We can easily edit the network to navigate to a range of maze destinations (e.g. coordinate x=4, y=7), by hand-editing at most X% of activations, for X=
- .01%: 40%
- .1%: 62%
- 1%: 65%
- 10%: 80%
- (Not possible): 20%
The network has a “single mesa objective” which it “plans” over, in some reasonable sense:

10%

The agent has several contextually activated goals:

20%

The agent has something else weirder than both (1) and (2):

70%

Other questions

At least some decision-steering influences are stored in an obviously interpretable manner (e.g. a positive activation representing where the agent is “trying” to go in this maze, such that changing the activation changes where the agent goes):

90% (I think this will be true but not steer the action in all situations, only some; kind of like a shard)

The model has a substantial number of trivially-interpretable convolutional channels after the first Impala block (see diagram here):

55% (“substantial number” probably too many, I put 80% probability on that it has 5 such channels)

This network’s shards/policy influences are roughly disjoint from the rest of agent capabilities. EG you can edit/train what the agent’s trying to do (e.g. go to maze location A) without affecting its general maze-solving abilities:

60%

Conformity with update rule: see the predictionbook questions

Adrià Garriga-alonso comments on Predictions for shard theory mechanistic interpretability results

Other questions