Interpretability methods like SAEs often treat models as if their residual stream represents a bag of concepts. But how does this account for binding (red cube, blue sphere vs red, blue, cube, sphere)? Shouldn’t we search for (subject, predicate, object) representations instead?
You could imagine a world where the model handles binding mostly via the token index and grammar rules. I.e. ‘red cube, blue sphere’ would have a ‘red’ feature at token t, ‘cube’ feature at token t+1, ‘blue’ feature at t+2, and ‘sphere’ feature at t+3, with contributions like ‘cube’ at t+2 being comparatively subdominant or even nonexistent.
I don’t think I really believe this. But if you want to stick to a picture where features are directions, with no further structure of consequence in the activation space, you can do that, at least on paper.
Is this compatible with the actual evidence about activation structure we have? I don’t know. I haven’t come across any systematic investigations into this yet. But I’d guess probably not.
Relevant. Section 3 is the one I found interesting.
If you wanted to check for matrix binding like this in real models, you could maybe do it by training an SAE with a restricted output matrix. Instead of each dictionary element being independent, you demand that Wout for your SAE can be written as (1+A)W′out, where Wout∈Rd×dSAE, A∈Rd×d, Wout∈Rd×dSAE/2. So, we demand that the second half of the SAE dictionary is just some linear transform of the first half.
That’d be the setup for pairs. Go Wout=(1+A+B)W′out for three slots, and so on.
(To be clear, I’m also not that optimistic about this sort of sparse coding + matrix binding model for activation space. I’ve come to think that activations-first mech interp is probably the wrong way to approach things in general. But it’d still be a neat thing for someone to check.)
I quickly tested if SigLIP or CLIP embeddings show evidence of attribute binding and they don’t (however n=1 image) - an image of a red cube with a blue sphere compared with texts “red cube next to blue sphere” and “blue cube next to red sphere” doesn’t get a higher similarity score for the correct label than for the wrong one (CLIP, SigLIP).
Interpretability methods like SAEs often treat models as if their residual stream represents a bag of concepts. But how does this account for binding (red cube, blue sphere vs red, blue, cube, sphere)? Shouldn’t we search for (subject, predicate, object) representations instead?
You could imagine a world where the model handles binding mostly via the token index and grammar rules. I.e. ‘red cube, blue sphere’ would have a ‘red’ feature at token t, ‘cube’ feature at token t+1, ‘blue’ feature at t+2, and ‘sphere’ feature at t+3, with contributions like ‘cube’ at t+2 being comparatively subdominant or even nonexistent.
I don’t think I really believe this. But if you want to stick to a picture where features are directions, with no further structure of consequence in the activation space, you can do that, at least on paper.
Is this compatible with the actual evidence about activation structure we have? I don’t know. I haven’t come across any systematic investigations into this yet. But I’d guess probably not.
Relevant. Section 3 is the one I found interesting.
If you wanted to check for matrix binding like this in real models, you could maybe do it by training an SAE with a restricted output matrix. Instead of each dictionary element being independent, you demand that Wout for your SAE can be written as (1+A)W′out, where Wout∈Rd×dSAE, A∈Rd×d, Wout∈Rd×dSAE/2. So, we demand that the second half of the SAE dictionary is just some linear transform of the first half.
That’d be the setup for pairs. Go Wout=(1+A+B)W′out for three slots, and so on.
(To be clear, I’m also not that optimistic about this sort of sparse coding + matrix binding model for activation space. I’ve come to think that activations-first mech interp is probably the wrong way to approach things in general. But it’d still be a neat thing for someone to check.)
Thanks for the link and suggestions!
I quickly tested if SigLIP or CLIP embeddings show evidence of attribute binding and they don’t (however n=1 image) - an image of a red cube with a blue sphere compared with texts “red cube next to blue sphere” and “blue cube next to red sphere” doesn’t get a higher similarity score for the correct label than for the wrong one (CLIP, SigLIP).
Nice quick check!
Just to be clear: This is for the actual full models? Or for the ‘model embeddings’ as in you’re doing a comparison right after the embedding layer?
This is for the full models—I simply used both models on replicate and gave one image and two text labels as input: CLIP, SigLIP