Thanks for the link and suggestions!
I quickly tested if SigLIP or CLIP embeddings show evidence of attribute binding and they don’t (however n=1 image) - an image of a red cube with a blue sphere compared with texts “red cube next to blue sphere” and “blue cube next to red sphere” doesn’t get a higher similarity score for the correct label than for the wrong one (CLIP, SigLIP).
This is for the full models—I simply used both models on replicate and gave one image and two text labels as input: CLIP, SigLIP