I think this is enough to make a hypothesis on how the network works and how the goal misgeneralization happens:
Somewhere inside the model, there is a set of individual components that respond to different inputs, and when they activate, they push for a particular action. Channel 121 is an example of such a component.
The last layers somehow aggregate information from all of the individual components.
Components sometimes activate for the action that leads to the cheese and sometimes for the action that leads to the top right corner.[9]
If the aggregated “push” for the action leading to the cheese is higher than for the action leading to the top right corner, the mouse goes to the cheese. Otherwise, it goes to the top right corner.
I think this is basically a shard theory picture/framing of how the network works: Inside the model there are multiple motivational circuits (“shards”) which are contextually activated (i.e. step 3) and whose outputs are aggregated into a final decision (i.e. step 4).
This is really cool. Great followup work!
I think this is basically a shard theory picture/framing of how the network works: Inside the model there are multiple motivational circuits (“shards”) which are contextually activated (i.e. step 3) and whose outputs are aggregated into a final decision (i.e. step 4).
Thanks! Indeed, shard theory fits here pretty well. I didn’t think about that while writing the post.