You might be interested in Concept Algebra for (Score-Based) Text-Controlled Generative Models, which uses both a somewhat similar empirical methodology for their concept editing and also provides theoretical reasons to expect the linear representation hypothesis to hold (I’d also interpret the findings here and those from other recent works, like Anthropic’s sleeper probes, as evidence towards the linear representation hypothesis broadly).
You might be interested in Concept Algebra for (Score-Based) Text-Controlled Generative Models, which uses both a somewhat similar empirical methodology for their concept editing and also provides theoretical reasons to expect the linear representation hypothesis to hold (I’d also interpret the findings here and those from other recent works, like Anthropic’s sleeper probes, as evidence towards the linear representation hypothesis broadly).