You might enjoy Concept Algebra for (Score-Based) Text-Controlled Generative Models (and probably other papers / videos from Victor Veitch’s groups), which tries to come up with something like a theoretical explanation for the linear represenation hypothesis, including some of the discussion in the reviews / rebuttals for the above paper, e.g.:
’Causal Separability The intuitive idea here is that the separability of factors of variation boils down to whether there are “non-ignorable” interactions in the structural equation model that generates the output from the latent factors of variation—hence the name. The formal definition 3.2 relaxes this causal requirement to distributional assumptions. We have added its causal interpretation in the camera ready version.
Application to Other Generative Models Ultimately, the results in the paper are about non-parametric representations (indeed, the results are about the structure of probability distributions directly!) The importance of diffusion models is that they non-parametrically model the conditional distribution, so that the score representation directly inherits the properties of the distribution.
To apply the results to other generative models, we must articulate the connection between the natural representations of these models (e.g., the residual stream in transformers) and the (estimated) conditional distributions. For autoregressive models like Parti, it’s not immediately clear how to do this. This is an exciting and important direction for future work!
(Very speculatively: models with finite dimensional representations are often trained with objective functions corresponding to log likelihoods of exponential family probability models, such that the natural finite dimensional representation corresponds to the natural parameter of the exponential family model. In exponential family models, the Stein score is exactly the inner product of the natural parameter with y. This weakly suggests that additive subspace structure may originate in these models following the same Stein score representation arguments!)
Connection to Interpretability This is a great question! Indeed, a major motivation for starting this line of work is to try to understand if the ″linear subspace hypothesis″ in mechanistic interpretability of transformers is true, and why it arises if so. As just discussed, the missing step for precisely connecting our results to this line of work is articulating how the finite dimensional transformer representation (the residual stream) relates to the log probability of the conditional distributions. Solving this missing step would presumably allow the tool set developed here to be brought to bear on the interpretation of transformers.
One exciting observation here is that linear subspace structure appears to be a generic feature of probability distributions! Much mechanistic interpretability work motivates the linear subspace hypothesis by appealing to special structure of the transformer architecture (e.g., this is Anthropic’s usual explanation). In contrast, our results suggest that linear encoding may fundamentally be about the structure of the data generating process.
Limitations One important thing to note: the causal separability assumption is required for the concepts to be separable in the conditional distribution itself. This is a fundamental restriction on what concepts can be learned by any method that (approximately) learns a conditional distribution. I.e., it’s a limitation of the data generating process, not special to concept algebra or even diffusion models.
Now, it is true that to find the concept subspace using prompts we have to be able to find prompts that elicit causally separable concepts. However, this is not so onerous—because sex and species are not separable, we can’t elicit the sex concept with ″buck″ and ″doe″. But the prompts ″a woman″ and ″a man″ work well.′
You might enjoy Concept Algebra for (Score-Based) Text-Controlled Generative Models (and probably other papers / videos from Victor Veitch’s groups), which tries to come up with something like a theoretical explanation for the linear represenation hypothesis, including some of the discussion in the reviews / rebuttals for the above paper, e.g.:
’Causal Separability The intuitive idea here is that the separability of factors of variation boils down to whether there are “non-ignorable” interactions in the structural equation model that generates the output from the latent factors of variation—hence the name. The formal definition 3.2 relaxes this causal requirement to distributional assumptions. We have added its causal interpretation in the camera ready version.
Application to Other Generative Models Ultimately, the results in the paper are about non-parametric representations (indeed, the results are about the structure of probability distributions directly!) The importance of diffusion models is that they non-parametrically model the conditional distribution, so that the score representation directly inherits the properties of the distribution.
To apply the results to other generative models, we must articulate the connection between the natural representations of these models (e.g., the residual stream in transformers) and the (estimated) conditional distributions. For autoregressive models like Parti, it’s not immediately clear how to do this. This is an exciting and important direction for future work!
(Very speculatively: models with finite dimensional representations are often trained with objective functions corresponding to log likelihoods of exponential family probability models, such that the natural finite dimensional representation corresponds to the natural parameter of the exponential family model. In exponential family models, the Stein score is exactly the inner product of the natural parameter with y. This weakly suggests that additive subspace structure may originate in these models following the same Stein score representation arguments!)
Connection to Interpretability This is a great question! Indeed, a major motivation for starting this line of work is to try to understand if the ″linear subspace hypothesis″ in mechanistic interpretability of transformers is true, and why it arises if so. As just discussed, the missing step for precisely connecting our results to this line of work is articulating how the finite dimensional transformer representation (the residual stream) relates to the log probability of the conditional distributions. Solving this missing step would presumably allow the tool set developed here to be brought to bear on the interpretation of transformers.
One exciting observation here is that linear subspace structure appears to be a generic feature of probability distributions! Much mechanistic interpretability work motivates the linear subspace hypothesis by appealing to special structure of the transformer architecture (e.g., this is Anthropic’s usual explanation). In contrast, our results suggest that linear encoding may fundamentally be about the structure of the data generating process.
Limitations One important thing to note: the causal separability assumption is required for the concepts to be separable in the conditional distribution itself. This is a fundamental restriction on what concepts can be learned by any method that (approximately) learns a conditional distribution. I.e., it’s a limitation of the data generating process, not special to concept algebra or even diffusion models.
Now, it is true that to find the concept subspace using prompts we have to be able to find prompts that elicit causally separable concepts. However, this is not so onerous—because sex and species are not separable, we can’t elicit the sex concept with ″buck″ and ″doe″. But the prompts ″a woman″ and ″a man″ work well.′