Thanks for the question! This is not something we have included in our distribution, so I think our patching experiments aren’t answering that question. If I would speculate though, I’d suggest
The Prev Tok head 1.4 might “check” for a signature of “I am inside a function definition” (maybe a L0 head that attends to the def keyword. This would make it work only on B_def not B_dec
Duplicate Tok head 1.2 might help the mover heads by suppressing their attention to repeated tokens. We observed this (“Duplicate Token Head 1.2 is helping Argument Movers”), but were not confident whether it is important. When doing ACDC we felt 1.2 wasn’t actually too important (IIRC) but again this would depend on the distribution
In summary, I can think of a range of possible mechanisms how the model could achieve that, but our experiments don’t test for that (because copying the 2nd token after B_dec would be equally bad for the clean and corrupted prompts).
Thanks for the question! This is not something we have included in our distribution, so I think our patching experiments aren’t answering that question. If I would speculate though, I’d suggest
The Prev Tok head 1.4 might “check” for a signature of “I am inside a function definition” (maybe a L0 head that attends to the
def
keyword. This would make it work only on B_def not B_decDuplicate Tok head 1.2 might help the mover heads by suppressing their attention to repeated tokens. We observed this (“Duplicate Token Head 1.2 is helping Argument Movers”), but were not confident whether it is important. When doing ACDC we felt 1.2 wasn’t actually too important (IIRC) but again this would depend on the distribution
In summary, I can think of a range of possible mechanisms how the model could achieve that, but our experiments don’t test for that (because copying the 2nd token after B_dec would be equally bad for the clean and corrupted prompts).