I feel a bit behind on everything going on in alignment, so for the next weeks (or more) I’ll focus on catching up on what ever I find interesting. I’ll be using my short form, to record my though.
I make no promises that reading this is worth anyone’s time.
Linda’s alignment reading adventures part 1
What to focus on?
I do have some opinions on what aliment directions are more or less promising. I’ll probably venture in other directions too, but my main focus is going to be around what I expect an alignment solution to look like.
I think that to have an aligned AI it is necessary (but not sufficient) that we have shared abstractions/ontology/concepts/ (what ever you want to call it) with the AI.
I think the way to make progress on the above is to understand what ontology/concepts/abstraction our current AIs are using, and the process that shapes these abstraction.
I think the way to do this is though mech-interp, mixed with philosophising and theorising. Currently I think the mech-interp part (i.e. look at what is actually going on in a network) is the bottleneck, since I think that philosophising with out data (i.e. agent foundations) has not made much progress lately.
Conclusion:
I’ll mainly focus on reading up on mech-interp and related areas such as dev-interp. I’ve started on the interp section of Lucius’s aliment reading list.
I should also read some John Wentworth, since his plan is pretty close to the path I think is most promising.
But also, how interesting is this. Basically they removed the cheese observation, it made the agent act as if there where no cheese. This is not some sophisticated steering technique that we can use to align the AIs motivation.
I discussed this with Lucius who pointed out, that the interesting result is that: The the cheese location information is linearly separable from other information, in the middle of the network. I.e. it’s not scrambled in a completely opaque way.
Alon’s book is the ideal counterargument to the idea that organisms are inherently human-opaque: it directly demonstrates the human-understandable structures which comprise real biological systems.
Both these posts are evidence for the hypothesis that we should expect evolved networks to be modular, in a way that is possible for us to decode.
By “evolved” I mean things in the same category as natural selection and gradient decent.
You might enjoy Concept Algebra for (Score-Based) Text-Controlled Generative Models (and probably other papers / videos from Victor Veitch’s groups), which tries to come up with something like a theoretical explanation for the linear represenation hypothesis, including some of the discussion in the reviews / rebuttals for the above paper, e.g.:
’Causal Separability The intuitive idea here is that the separability of factors of variation boils down to whether there are “non-ignorable” interactions in the structural equation model that generates the output from the latent factors of variation—hence the name. The formal definition 3.2 relaxes this causal requirement to distributional assumptions. We have added its causal interpretation in the camera ready version.
Application to Other Generative Models Ultimately, the results in the paper are about non-parametric representations (indeed, the results are about the structure of probability distributions directly!) The importance of diffusion models is that they non-parametrically model the conditional distribution, so that the score representation directly inherits the properties of the distribution.
To apply the results to other generative models, we must articulate the connection between the natural representations of these models (e.g., the residual stream in transformers) and the (estimated) conditional distributions. For autoregressive models like Parti, it’s not immediately clear how to do this. This is an exciting and important direction for future work!
(Very speculatively: models with finite dimensional representations are often trained with objective functions corresponding to log likelihoods of exponential family probability models, such that the natural finite dimensional representation corresponds to the natural parameter of the exponential family model. In exponential family models, the Stein score is exactly the inner product of the natural parameter with y. This weakly suggests that additive subspace structure may originate in these models following the same Stein score representation arguments!)
Connection to Interpretability This is a great question! Indeed, a major motivation for starting this line of work is to try to understand if the ″linear subspace hypothesis″ in mechanistic interpretability of transformers is true, and why it arises if so. As just discussed, the missing step for precisely connecting our results to this line of work is articulating how the finite dimensional transformer representation (the residual stream) relates to the log probability of the conditional distributions. Solving this missing step would presumably allow the tool set developed here to be brought to bear on the interpretation of transformers.
One exciting observation here is that linear subspace structure appears to be a generic feature of probability distributions! Much mechanistic interpretability work motivates the linear subspace hypothesis by appealing to special structure of the transformer architecture (e.g., this is Anthropic’s usual explanation). In contrast, our results suggest that linear encoding may fundamentally be about the structure of the data generating process.
Limitations One important thing to note: the causal separability assumption is required for the concepts to be separable in the conditional distribution itself. This is a fundamental restriction on what concepts can be learned by any method that (approximately) learns a conditional distribution. I.e., it’s a limitation of the data generating process, not special to concept algebra or even diffusion models.
Now, it is true that to find the concept subspace using prompts we have to be able to find prompts that elicit causally separable concepts. However, this is not so onerous—because sex and species are not separable, we can’t elicit the sex concept with ″buck″ and ″doe″. But the prompts ″a woman″ and ″a man″ work well.′
I feel a bit behind on everything going on in alignment, so for the next weeks (or more) I’ll focus on catching up on what ever I find interesting. I’ll be using my short form, to record my though.
I make no promises that reading this is worth anyone’s time.
Linda’s alignment reading adventures part 1
What to focus on?
I do have some opinions on what aliment directions are more or less promising. I’ll probably venture in other directions too, but my main focus is going to be around what I expect an alignment solution to look like.
I think that to have an aligned AI it is necessary (but not sufficient) that we have shared abstractions/ontology/concepts/ (what ever you want to call it) with the AI.
I think the way to make progress on the above is to understand what ontology/concepts/abstraction our current AIs are using, and the process that shapes these abstraction.
I think the way to do this is though mech-interp, mixed with philosophising and theorising. Currently I think the mech-interp part (i.e. look at what is actually going on in a network) is the bottleneck, since I think that philosophising with out data (i.e. agent foundations) has not made much progress lately.
Conclusion:
I’ll mainly focus on reading up on mech-interp and related areas such as dev-interp. I’ve started on the interp section of Lucius’s aliment reading list.
I should also read some John Wentworth, since his plan is pretty close to the path I think is most promising.
Feel free to though other recommendations at me.
Some though on things I read so far
I just read
Understanding and controlling a maze-solving policy network
and half of Book Review: Design Principles of Biological Circuits (I’ll finish it soon)
I really liked Understanding and controlling a maze-solving policy network. It’s a good experiment and a good writeup.
But also, how interesting is this. Basically they removed the cheese observation, it made the agent act as if there where no cheese. This is not some sophisticated steering technique that we can use to align the AIs motivation.
I discussed this with Lucius who pointed out, that the interesting result is that: The the cheese location information is linearly separable from other information, in the middle of the network. I.e. it’s not scrambled in a completely opaque way.
Which brings me to Book Review: Design Principles of Biological Circuits
Both these posts are evidence for the hypothesis that we should expect evolved networks to be modular, in a way that is possible for us to decode.
By “evolved” I mean things in the same category as natural selection and gradient decent.
You might enjoy Concept Algebra for (Score-Based) Text-Controlled Generative Models (and probably other papers / videos from Victor Veitch’s groups), which tries to come up with something like a theoretical explanation for the linear represenation hypothesis, including some of the discussion in the reviews / rebuttals for the above paper, e.g.:
’Causal Separability The intuitive idea here is that the separability of factors of variation boils down to whether there are “non-ignorable” interactions in the structural equation model that generates the output from the latent factors of variation—hence the name. The formal definition 3.2 relaxes this causal requirement to distributional assumptions. We have added its causal interpretation in the camera ready version.
Application to Other Generative Models Ultimately, the results in the paper are about non-parametric representations (indeed, the results are about the structure of probability distributions directly!) The importance of diffusion models is that they non-parametrically model the conditional distribution, so that the score representation directly inherits the properties of the distribution.
To apply the results to other generative models, we must articulate the connection between the natural representations of these models (e.g., the residual stream in transformers) and the (estimated) conditional distributions. For autoregressive models like Parti, it’s not immediately clear how to do this. This is an exciting and important direction for future work!
(Very speculatively: models with finite dimensional representations are often trained with objective functions corresponding to log likelihoods of exponential family probability models, such that the natural finite dimensional representation corresponds to the natural parameter of the exponential family model. In exponential family models, the Stein score is exactly the inner product of the natural parameter with y. This weakly suggests that additive subspace structure may originate in these models following the same Stein score representation arguments!)
Connection to Interpretability This is a great question! Indeed, a major motivation for starting this line of work is to try to understand if the ″linear subspace hypothesis″ in mechanistic interpretability of transformers is true, and why it arises if so. As just discussed, the missing step for precisely connecting our results to this line of work is articulating how the finite dimensional transformer representation (the residual stream) relates to the log probability of the conditional distributions. Solving this missing step would presumably allow the tool set developed here to be brought to bear on the interpretation of transformers.
One exciting observation here is that linear subspace structure appears to be a generic feature of probability distributions! Much mechanistic interpretability work motivates the linear subspace hypothesis by appealing to special structure of the transformer architecture (e.g., this is Anthropic’s usual explanation). In contrast, our results suggest that linear encoding may fundamentally be about the structure of the data generating process.
Limitations One important thing to note: the causal separability assumption is required for the concepts to be separable in the conditional distribution itself. This is a fundamental restriction on what concepts can be learned by any method that (approximately) learns a conditional distribution. I.e., it’s a limitation of the data generating process, not special to concept algebra or even diffusion models.
Now, it is true that to find the concept subspace using prompts we have to be able to find prompts that elicit causally separable concepts. However, this is not so onerous—because sex and species are not separable, we can’t elicit the sex concept with ″buck″ and ″doe″. But the prompts ″a woman″ and ″a man″ work well.′