Another related work: Concept Algebra for Text-Controlled Vision Models (Discloser: while I did not author this paper, I am in the PhD lab who did, under Victor Veitch at UChicago. Any mistakes made in this comment are my own). We haven’t prioritized a blog post about the paper so it makes sense that this community isn’t familiar with it.
The concept algebra paper demonstrates that for text-to-image models like Stable Diffusion, there exist linear subspaces in the score embedding space, on which you can do the same manner of concept editing/control as Word-to-Vec.
Importantly, the paper comes with some theoretical investigation into why this might be the case, including articulating necessary assumptions/conditions (which this purely-empirical post does not).
I conjecture that the reason that <some activation additions in this post fail to have the desired effect> may be because they violate some conditions analogous to those in Concept Algebra: it feels a bit deja-vu to look at section E.1 in the appendix, of some empirical results which fail to act as expected when the conditions of completeness and causal separability don’t hold.
Another related work: Concept Algebra for Text-Controlled Vision Models (Discloser: while I did not author this paper, I am in the PhD lab who did, under Victor Veitch at UChicago. Any mistakes made in this comment are my own). We haven’t prioritized a blog post about the paper so it makes sense that this community isn’t familiar with it.
The concept algebra paper demonstrates that for text-to-image models like Stable Diffusion, there exist linear subspaces in the score embedding space, on which you can do the same manner of concept editing/control as Word-to-Vec.
Importantly, the paper comes with some theoretical investigation into why this might be the case, including articulating necessary assumptions/conditions (which this purely-empirical post does not).
I conjecture that the reason that <some activation additions in this post fail to have the desired effect> may be because they violate some conditions analogous to those in Concept Algebra: it feels a bit deja-vu to look at section E.1 in the appendix, of some empirical results which fail to act as expected when the conditions of completeness and causal separability don’t hold.
Seems very related: Linear Spaces of Meanings: Compositional Structures in Vision-Language Models. Notably, the (approximate) compositionality of language/reality should bode well for the scalability of linear activation engineering methods.
And this structure can be used as regularization for soft prompts.