Lucius Bushnaq comments on nielsrolf’s Shortform

Lucius Bushnaq 10 Sep 2024 18:26 UTC
3 points
1
You could imagine a world where the model handles binding mostly via the token index and grammar rules. I.e. ‘red cube, blue sphere’ would have a ‘red’ feature at token $t$ , ‘cube’ feature at token $t + 1$ , ‘blue’ feature at $t + 2$ , and ‘sphere’ feature at $t + 3$ , with contributions like ‘cube’ at $t + 2$ being comparatively subdominant or even nonexistent.
I don’t think I really believe this. But if you want to stick to a picture where features are directions, with no further structure of consequence in the activation space, you can do that, at least on paper.

Is this compatible with the actual evidence about activation structure we have? I don’t know. I haven’t come across any systematic investigations into this yet. But I’d guess probably not.

Relevant. Section 3 is the one I found interesting.

If you wanted to check for matrix binding like this in real models, you could maybe do it by training an SAE with a restricted output matrix. Instead of each dictionary element being independent, you demand that $W_{out}$ for your SAE can be written as $(1 + A) W_{out}^{'}$ , where $W_{out} \in R^{d \times d_{SAE}}$ , $A \in R^{d \times d}$ , $W_{out} \in R^{d \times d_{SAE} / 2}$ . So, we demand that the second half of the SAE dictionary is just some linear transform of the first half.

That’d be the setup for pairs. Go $W_{o u t} = (1 + A + B) W_{out}^{'}$ for three slots, and so on.

(To be clear, I’m also not that optimistic about this sort of sparse coding + matrix binding model for activation space. I’ve come to think that activations-first mech interp is probably the wrong way to approach things in general. But it’d still be a neat thing for someone to check.)
- nielsrolf 10 Sep 2024 19:37 UTC
  7 points
  2
  Parent
  Thanks for the link and suggestions!
  I quickly tested if SigLIP or CLIP embeddings show evidence of attribute binding and they don’t (however n=1 image) - an image of a red cube with a blue sphere compared with texts “red cube next to blue sphere” and “blue cube next to red sphere” doesn’t get a higher similarity score for the correct label than for the wrong one (CLIP, SigLIP).
  - Lucius Bushnaq 10 Sep 2024 20:27 UTC
    3 points
    0
    Parent
    Nice quick check!
    Just to be clear: This is for the actual full models? Or for the ‘model embeddings’ as in you’re doing a comparison right after the embedding layer?
    - nielsrolf 10 Sep 2024 20:58 UTC
      3 points
      0
      Parent
      This is for the full models—I simply used both models on replicate and gave one image and two text labels as input: CLIP, SigLIP