So this suggests that, if you ablate a random feature, then in contexts where that feature doesn’t apply, doing so will have some (apparently random) effect on the model’s emitted logits, suggesting that there is generally some crosstalk/interdependencies between features, and that to some extent “(almost) everything depends on (almost) everything else” — would that be your interpretation?
If so, that’s not entirely surprising for a system that relies on only approximate orthogonality, but could be inconvenient. For example, it suggests that any security/alignment procedure that depended upon effectively ablating a large number of specific circuits (once we had identified such circuits in need of ablation) might introduce a level of noise that presumably scales with the number of circuits ablated, and might require, for example, some subsequent finetuning on a broad corpus to restore previous levels of broad model performance.?
So this suggests that, if you ablate a random feature, then in contexts where that feature doesn’t apply, doing so will have some (apparently random) effect on the model’s emitted logits, suggesting that there is generally some crosstalk/interdependencies between features, and that to some extent “(almost) everything depends on (almost) everything else” — would that be your interpretation?
If so, that’s not entirely surprising for a system that relies on only approximate orthogonality, but could be inconvenient. For example, it suggests that any security/alignment procedure that depended upon effectively ablating a large number of specific circuits (once we had identified such circuits in need of ablation) might introduce a level of noise that presumably scales with the number of circuits ablated, and might require, for example, some subsequent finetuning on a broad corpus to restore previous levels of broad model performance.?