I recently applied causal scrubbing to test the hypothesis outlined in the paper (as part of my work at Redwood Research). The hypothesis was defined from the circuit presented in Figure 2. I used a simple setting similar to the experiments on Induction Heads. I used two types of inputs:
xref, the correct input for the circuit.
xscrub, an input with the same template but a randomized subject and indirect object. Used as input for the path not included in the circuit.
Results
Experiment 1
I allowed all MLPs on every path of the circuit. The only attention heads non-scrubbed in this hypothesis are the ones from the circuit, split by key, queries, values, and position as in the circuit diagram.
This experiment is directly addressing our claim in the paper, as we did not study MLPs (i.e. they are always acting as black boxes in our experiments).
The logit difference of the scrubbed model is 1.854±2.127 (mean±std), 50%±57 of the original logit difference.
Experiment 2
I connected Name Movers’ keys and values to MLP0 at the IO and S1 positions. All the paths from embeddings to these MLP0 are allowed.
I allowed all the paths involving all attention heads and MLPs from the embeddings to
The queries of S-Inhibition Heads at the END position.
The queries of the Duplicate Token Heads and Induction Heads at the S2 position.
The keys and values of the Duplicate Token Heads at the S1 and IO position.
The value of Previous Token heads at the S1 and IO position.
Inside the circuit, only the direct interactions between heads are preserved.
The logit difference of the scrubbed model is 0.831±2.127 (22%±64 of the original logit difference).
Comments
How to interpret these numbers? I don’t really know. My best guess is that the circuit we presented in the paper is one of the best small sets of paths to look at to describe how GPT-2 small achieves high logit difference on IOI. However, in absolute, many more paths that we don’t have a good way to describe succinctly matter.
The measures are extremely noisy. This is consistent with the measure of logit difference on the original model, where the standard deviation was around 30% of the mean logit difference. However, I ran causal scrubbing on a dataset with big enough samples (N=100) that the numbers are interpretable. I don’t understand the source of this noise, but suspect it is caused by various names triggering various internal structures inside GPT-2 small.
How does it compare to the validations from the paper? The closest validation score from the paper is the faithfulness score—computing the logit difference of the model where all the nodes not in the circuit are mean ablated. In the paper, we present a score of 87% of the logit difference. I think the discrepancies with the causal scrubbing results from experiment 1 come from the fact that i) resampling ablation is more destructive than mean ablation in this case. ii) that causal scrubbing is stricter as it selects paths and not nodes (e.g. in the faithfulness tests, Name Mover Heads at the END position see the correct output of Induction Heads at the S2 position. This is not the case in causal scrubbing). The results of the causal scrubbing experiments update me toward thinking that our explanation is not as good as I thought from our faithfulness score, but not low enough to strongly question the claims from the paper.
I recently applied causal scrubbing to test the hypothesis outlined in the paper (as part of my work at Redwood Research). The hypothesis was defined from the circuit presented in Figure 2. I used a simple setting similar to the experiments on Induction Heads. I used two types of inputs:
xref, the correct input for the circuit.
xscrub, an input with the same template but a randomized subject and indirect object. Used as input for the path not included in the circuit.
Results
Experiment 1
I allowed all MLPs on every path of the circuit. The only attention heads non-scrubbed in this hypothesis are the ones from the circuit, split by key, queries, values, and position as in the circuit diagram.
This experiment is directly addressing our claim in the paper, as we did not study MLPs (i.e. they are always acting as black boxes in our experiments).
The logit difference of the scrubbed model is 1.854±2.127 (mean±std), 50%±57 of the original logit difference.
Experiment 2
I connected Name Movers’ keys and values to MLP0 at the IO and S1 positions. All the paths from embeddings to these MLP0 are allowed.
I allowed all the paths involving all attention heads and MLPs from the embeddings to
The queries of S-Inhibition Heads at the END position.
The queries of the Duplicate Token Heads and Induction Heads at the S2 position.
The keys and values of the Duplicate Token Heads at the S1 and IO position.
The value of Previous Token heads at the S1 and IO position.
Inside the circuit, only the direct interactions between heads are preserved.
The logit difference of the scrubbed model is 0.831±2.127 (22%±64 of the original logit difference).
Comments
How to interpret these numbers? I don’t really know. My best guess is that the circuit we presented in the paper is one of the best small sets of paths to look at to describe how GPT-2 small achieves high logit difference on IOI. However, in absolute, many more paths that we don’t have a good way to describe succinctly matter.
The measures are extremely noisy. This is consistent with the measure of logit difference on the original model, where the standard deviation was around 30% of the mean logit difference. However, I ran causal scrubbing on a dataset with big enough samples (N=100) that the numbers are interpretable. I don’t understand the source of this noise, but suspect it is caused by various names triggering various internal structures inside GPT-2 small.
How does it compare to the validations from the paper? The closest validation score from the paper is the faithfulness score—computing the logit difference of the model where all the nodes not in the circuit are mean ablated. In the paper, we present a score of 87% of the logit difference. I think the discrepancies with the causal scrubbing results from experiment 1 come from the fact that i) resampling ablation is more destructive than mean ablation in this case. ii) that causal scrubbing is stricter as it selects paths and not nodes (e.g. in the faithfulness tests, Name Mover Heads at the END position see the correct output of Induction Heads at the S2 position. This is not the case in causal scrubbing). The results of the causal scrubbing experiments update me toward thinking that our explanation is not as good as I thought from our faithfulness score, but not low enough to strongly question the claims from the paper.