The main thing I want to do now is replicate the results from a particular paper whose name I can’t remember right now, where an RL agent was trained to navigate to a cheese in the top right corner of a maze, apply this method to the training gradients, and see whether we can locate which parameters are responsible for the if bottom_left(), then navigate_to_top_right() cognition, and which are responsible for the if top_right(), then navigate_to_cheese() cognition, which should be determinable by their time-step distribution.
That is, if bottom_left(), then navigate_to_top_right() should be associated with reinforcement events sooner during training rather than later, so the left singular values locating parameters responsible for that computation should have corresponding right singular values with high-in-magnitude numbers in their beginnings and low-in-magnitude numbers in their ends. Similarly, if top_right(), then navigate_to_cheese() should be associated with reinforcement events later during training, so the opposite holds.
Then I want to verify that we have indeed found the right parameters by ablating the model’s tendency to go to the cheese after its reached the top right corner.
It would also be interesting to see whether we can ablate the ability for it to go to the top right corner while keeping the ability to go to the cheese if the cheese is sufficiently close or it is already in the top right corner. However this seems harder, and not as clearly possible given we’ve found the correct parameters.
I might be missing something, but is there a reason you’re doing this via SVD on gradients, instead of SVD on weights?
Is there a reason to do this with SVD at all, instead of mechanistic interp methods like causal scrubbing/causal tracing/path patching or manual inspection of circuits?
Another thought:
I might be missing something, but is there a reason you’re doing this via SVD on gradients, instead of SVD on weights?
Is there a reason to do this with SVD at all, instead of mechanistic interp methods like causal scrubbing/causal tracing/path patching or manual inspection of circuits?