I think the strongest reason to expect features-as-directions is when you have compositionality and decision points for the same stuff in the same layer. Compositionality means (by “means” here I mean a sort of loose optimality argument, not an iron-clad guarantee) the things getting composed can have distinguishable subspaces assigned to them, and having to make a decision means there’s a single direction perpendicular to the decision boundary.
If you have compositionality without decision boundaries, you can get compositional stuff that lives in higher-dimensional subspaces. If you have decisions without compositionality… hm, actually I’m having trouble imagining this. Maybe you get this if the manifold of the partially-processed data distribution isn’t of the same dimension on both sides of the decision boundary, or is otherwise especially badly behaved? (Or if you’re trying to generalize features across a wide swath of the data distribution, when those features might not actually be applicable across the whole domain.)
I’m not entirely sure I follow here; I am thinking of compositionally as a feature of the format of a representation (Chris Olah has a good note on this here https://transformer-circuits.pub/2023/superposition-composition/index.html).
I think whether we should expect one kind of representation or another is an interesting question, but ultimately an empirical one: there are some theoretical arguments for linear representations (basically that they should be easy for NNs to make decisions based on them) but the biggest reason to believe in them is just that people genuinely have found lots of examples of linear mediators that seem quite robust (e.g Golden Gate claude, neels stuff on refusal directions)
Yeah, I was probably equivocating confusingly between compositionality as a feature of the representation, and compositionality as a feature of the manifold that the data / activation distribution lives near.
If you imagine the manifold, then compositionality is the ability to have a coordinate system / decomposition where you can do some operation like averaging / recombination with two points on the manifold, and you’ll get a new point on the manifold. (I guess this making sense relies on the data / activation distribution not filling up the entire available space.)
I think the strongest reason to expect features-as-directions is when you have compositionality and decision points for the same stuff in the same layer. Compositionality means (by “means” here I mean a sort of loose optimality argument, not an iron-clad guarantee) the things getting composed can have distinguishable subspaces assigned to them, and having to make a decision means there’s a single direction perpendicular to the decision boundary.
If you have compositionality without decision boundaries, you can get compositional stuff that lives in higher-dimensional subspaces. If you have decisions without compositionality… hm, actually I’m having trouble imagining this. Maybe you get this if the manifold of the partially-processed data distribution isn’t of the same dimension on both sides of the decision boundary, or is otherwise especially badly behaved? (Or if you’re trying to generalize features across a wide swath of the data distribution, when those features might not actually be applicable across the whole domain.)
I’m not entirely sure I follow here; I am thinking of compositionally as a feature of the format of a representation (Chris Olah has a good note on this here https://transformer-circuits.pub/2023/superposition-composition/index.html). I think whether we should expect one kind of representation or another is an interesting question, but ultimately an empirical one: there are some theoretical arguments for linear representations (basically that they should be easy for NNs to make decisions based on them) but the biggest reason to believe in them is just that people genuinely have found lots of examples of linear mediators that seem quite robust (e.g Golden Gate claude, neels stuff on refusal directions)
Yeah, I was probably equivocating confusingly between compositionality as a feature of the representation, and compositionality as a feature of the manifold that the data / activation distribution lives near.
If you imagine the manifold, then compositionality is the ability to have a coordinate system / decomposition where you can do some operation like averaging / recombination with two points on the manifold, and you’ll get a new point on the manifold. (I guess this making sense relies on the data / activation distribution not filling up the entire available space.)