RogerDearnaley comments on Causal Graphs of GPT-2-Small’s Residual Stream

RogerDearnaley 10 Jul 2024 9:02 UTC
4 points
0
Ablating during randomly sampled openwebtext forward-passes yields basically random effects. This fits with circuit activation being quite contextual. But it’s disappointing, again, that we don’t see no effect whatsoever on off-distribution contexts.
This seems pretty important, and I’m not quite clear what you’re saying was done, or the results were like — could you expand on this?
- David Udell 10 Jul 2024 23:58 UTC
  2 points
  0
  Parent
  I sampled hundreds of short context snippets from openwebtext, and measured ablation effects averaged over those sampled forward-passes. Averaged over those hundreds of passes, I didn’t see any real signal in the logit effects, just a layer of noise due to the ablations.
  More could definitely be done on this front. I just tried something relatively quickly that fit inside of GPU memory and wanted to report it here.
  - RogerDearnaley 11 Jul 2024 4:12 UTC
    2 points
    0
    Parent
    So this suggests that, if you ablate a random feature, then in contexts where that feature doesn’t apply, doing so will have some (apparently random) effect on the model’s emitted logits, suggesting that there is generally some crosstalk/interdependencies between features, and that to some extent “(almost) everything depends on (almost) everything else” — would that be your interpretation?
    If so, that’s not entirely surprising for a system that relies on only approximate orthogonality, but could be inconvenient. For example, it suggests that any security/alignment procedure that depended upon effectively ablating a large number of specific circuits (once we had identified such circuits in need of ablation) might introduce a level of noise that presumably scales with the number of circuits ablated, and might require, for example, some subsequent finetuning on a broad corpus to restore previous levels of broad model performance.?