I have spent some time trying to do mechanistic interp on GPT-Neo, to try and answer whether compensation only occurs because of dropout. TLDR: my current impression is that compensation still occurs in models trained without dropout, just to a lesser extent.
In depth, when GPT-Neo is fed a sequence of tokens t1t2...t10t11t12...t20 where t1,...,t10 are uniformly random and ti=ti−10 for i≥11, there are four heads in Layer 6 that have the induction attention pattern (i.e attend from ti to ti−9). Three of these heads (6.0, 6.6, 6.11) when ablated decrease loss, and one of these heads increases loss on ablation (6.1). Interestingly, when 6.1 is ablated, the additional ablation of 6.0, 6.6 and 6.11 causes loss to increase (perhaps this is confusing, see this table!).
My guess is the model is able to use the outputs of 6.0, 6.6 and 6.11 differently in the two regimes, so they “compensate” when 6.1 is ablated.
Thanks for the comment!
I have spent some time trying to do mechanistic interp on GPT-Neo, to try and answer whether compensation only occurs because of dropout. TLDR: my current impression is that compensation still occurs in models trained without dropout, just to a lesser extent.
In depth, when GPT-Neo is fed a sequence of tokens t1t2...t10t11t12...t20 where t1,...,t10 are uniformly random and ti=ti−10 for i≥11, there are four heads in Layer 6 that have the induction attention pattern (i.e attend from ti to ti−9). Three of these heads (6.0, 6.6, 6.11) when ablated decrease loss, and one of these heads increases loss on ablation (6.1). Interestingly, when 6.1 is ablated, the additional ablation of 6.0, 6.6 and 6.11 causes loss to increase (perhaps this is confusing, see this table!).
My guess is the model is able to use the outputs of 6.0, 6.6 and 6.11 differently in the two regimes, so they “compensate” when 6.1 is ablated.