My hypothesis about what’s going on here, apologies if it’s already ruled out, is that we should not think of it separately computing the XOR of A and B, but rather that features A and B are computed slightly differently when the other feature is off or on. In a high dimensional space, if the vector (A|B) and the vector (A|¬B) are slightly different, then as long as this difference is systematic, this should be sufficient to successfully probe for A⊕B.
For example, if A and B each rely on a sizeable number of different attention heads to pull the information over, they will have some attention heads which participate in both of them, and they would ‘compete’ in the softmax, where if head C is used in both writing features A and B, it will contribute less to writing feature A if it is also being used to pull across feature B, and so the representation of A will be systematically different depending on the presence of B.
It’s harder to draw the exact picture for MLPs but I think similar interdependencies can occur there though I don’t have an exact picture of how, interested to discuss and can try and sketch it out if people are curious. Probably would be like, neurons will participate in both, neurons which participate in A and B will be more saturated if B is active than if B is not active, so the output representation of A will be somewhat dependent on B.
More generally, I expect the computation of features to be ‘good enough’ but still messy and somewhat dependent on which other features are present because this kludginess allows them to pack more computation into the same number of layers than if the features were computed totally independently.
Well the substance of the claim is that when a model is calculating lots of things in superposition, these kinds of XORs arise naturally as a result of interference, so one thing to do might be to look at a small algorithmic dataset of some kind where there’s a distinct set of features to learn and no reason to learn the XORs and see if you can still probe for them. It’d be interesting to see if there are some conditions under which this is/isn’t true, e.g. if needing to learn more features makes the dependence between their calculation higher and the XORs more visible.
Maybe you could also go a bit more mathematical and hand-construct a set of weights which calculates a set of features in superposition so you can totally rule out any model effort being expended on calculating XORs and then see if they’re still probe-able.
Another thing you could do is to zero-out or max-ent the neurons/attention heads that are important for calculating the A feature, and see if you can still detect an A⊕B feature. I’m less confident in this because it might be too strong and delete even a ‘legitimate’ A⊕B feature or too weak and leave some signal in.
This kind of interference also predicts that the A|B and A|¬B features should be similar and so the degree of separation/distance from the category boundary should be small. I think you’ve already shown this to some extent with the PCA stuff though some quantification of the distance to boundary would be interesting. Even if the model was allocating resource to computing these XORs you’d still probably expect them to be much less salient though so not sure if this gives much evidence either way.
You can get ~75% just by computing the or. But we found that only at the last layer and step16000 of Pythia-70m training it achieves better than 75%, see this video
My hypothesis about what’s going on here, apologies if it’s already ruled out, is that we should not think of it separately computing the XOR of A and B, but rather that features A and B are computed slightly differently when the other feature is off or on. In a high dimensional space, if the vector (A|B) and the vector (A|¬B) are slightly different, then as long as this difference is systematic, this should be sufficient to successfully probe for A⊕B.
For example, if A and B each rely on a sizeable number of different attention heads to pull the information over, they will have some attention heads which participate in both of them, and they would ‘compete’ in the softmax, where if head C is used in both writing features A and B, it will contribute less to writing feature A if it is also being used to pull across feature B, and so the representation of A will be systematically different depending on the presence of B.
It’s harder to draw the exact picture for MLPs but I think similar interdependencies can occur there though I don’t have an exact picture of how, interested to discuss and can try and sketch it out if people are curious. Probably would be like, neurons will participate in both, neurons which participate in A and B will be more saturated if B is active than if B is not active, so the output representation of A will be somewhat dependent on B.
More generally, I expect the computation of features to be ‘good enough’ but still messy and somewhat dependent on which other features are present because this kludginess allows them to pack more computation into the same number of layers than if the features were computed totally independently.
Neat hypothesis! Do you have any ideas for how one would experimentally test this?
Well the substance of the claim is that when a model is calculating lots of things in superposition, these kinds of XORs arise naturally as a result of interference, so one thing to do might be to look at a small algorithmic dataset of some kind where there’s a distinct set of features to learn and no reason to learn the XORs and see if you can still probe for them. It’d be interesting to see if there are some conditions under which this is/isn’t true, e.g. if needing to learn more features makes the dependence between their calculation higher and the XORs more visible.
Maybe you could also go a bit more mathematical and hand-construct a set of weights which calculates a set of features in superposition so you can totally rule out any model effort being expended on calculating XORs and then see if they’re still probe-able.
Another thing you could do is to zero-out or max-ent the neurons/attention heads that are important for calculating the A feature, and see if you can still detect an A⊕B feature. I’m less confident in this because it might be too strong and delete even a ‘legitimate’ A⊕B feature or too weak and leave some signal in.
This kind of interference also predicts that the A|B and A|¬B features should be similar and so the degree of separation/distance from the category boundary should be small. I think you’ve already shown this to some extent with the PCA stuff though some quantification of the distance to boundary would be interesting. Even if the model was allocating resource to computing these XORs you’d still probably expect them to be much less salient though so not sure if this gives much evidence either way.
Would you expect that we can extract xors from small models like pythia-70m under your hypothesis?
Yeah I’d expect some degree of interference leading to >50% success on XORs even in small models.
You can get ~75% just by computing the or. But we found that only at the last layer and step16000 of Pythia-70m training it achieves better than 75%, see this video