Arthur Conmy comments on A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)

Arthur Conmy 9 Nov 2022 5:45 UTC
2 points
1
AF
Thanks for the comment!
I have spent some time trying to do mechanistic interp on GPT-Neo, to try and answer whether compensation only occurs because of dropout. TLDR: my current impression is that compensation still occurs in models trained without dropout, just to a lesser extent.

In depth, when GPT-Neo is fed a sequence of tokens $t_{1} t_{2} . . . t_{10} t_{11} t_{12} . . . t_{20}$ where $t_{1}, . . ., t_{10}$ are uniformly random and $t_{i} = t_{i - 10}$ for $i \geq 11$ , there are four heads in Layer 6 that have the induction attention pattern (i.e attend from $t_{i}$ to $t_{i - 9}$ ). Three of these heads (6.0, 6.6, 6.11) when ablated decrease loss, and one of these heads increases loss on ablation (6.1). Interestingly, when 6.1 is ablated, the additional ablation of 6.0, 6.6 and 6.11 causes loss to increase (perhaps this is confusing, see this table!).
My guess is the model is able to use the outputs of 6.0, 6.6 and 6.11 differently in the two regimes, so they “compensate” when 6.1 is ablated.