LLM activation space is spiky.This is not a novel idea but something I believe many mechanistic interpretability researchers are not aware of. Credit toDmitry Vaintrobfor making this idea clear to me, and toDmitrii Krasheninnikovfor inspiring this plot by showing me a similar plot in a setup with categorical features.
Under the superposition hypothesis, activations are linear combinations of a small number of features. This means there are discrete subspaces in activation space that are “allowed” (can be written as the sum of a small number of features), while the remaining space is “disallowed” (require much more than the typical number of features).[1]
Here’s a toy model (following TMS, dvocab=8 total features in dembed=3-dimensional activation space, with k=1,2,3 features allowed to be active simultaneously). Activation space is made up of discrete k-dimensional (intersecting) subspaces. My favourite image is the middle one (k=2) showing planes in 3d activation space because we expect 1≪k≪dembed in realistic settings.
(nactive in the plot corresponds to k here. Code here.)
This picture predicts that interpolating between two activations should take you out-of-distribution relatively quickly (up to possibly some error correction) unless your interpolation (steering) direction exactly corresponds to a feature. I think this is relevant because
it implies my stable region experiment series [we observe models are robust to perturbations of their activations, 1, 2, 3, 4] should be quite severely out-of-distribution, which makes me even more confused about our results.
it predicts activation steering to be severely out-of-distribution unless you pick a steering direction that is aligned with (a linear combination of) active feature directions.
it predicts that linear probing shouldn’t give you nice continuous results: Probing into a feature direction should yield just interference noise most of the time (when the feature is inactive), and significant values only when the feature is active. Instead however, we typically observe non-negligible probe scores for most tokens.[2]
In the demo plots I assume exactly k features to be active. In reality we expect this to be a softer limit, for example L0=100±20 features active, but I believe that the qualitative conclusions still hold. The “allowed region” is just a bit softer, and looks more like the union of say a bunch of roughly 80 to 120 dimensional subspaces.
There’s various possible explanations of course, e.g. that we’re probing multiple features at once, or that the “deception feature” is just always active in these contexts (though consider theserandomAlpacasamples.)
LLM activation space is spiky. This is not a novel idea but something I believe many mechanistic interpretability researchers are not aware of. Credit to Dmitry Vaintrob for making this idea clear to me, and to Dmitrii Krasheninnikov for inspiring this plot by showing me a similar plot in a setup with categorical features.
Under the superposition hypothesis, activations are linear combinations of a small number of features. This means there are discrete subspaces in activation space that are “allowed” (can be written as the sum of a small number of features), while the remaining space is “disallowed” (require much more than the typical number of features).[1]
Here’s a toy model (following TMS, dvocab=8 total features in dembed=3-dimensional activation space, with k=1,2,3 features allowed to be active simultaneously). Activation space is made up of discrete k-dimensional (intersecting) subspaces. My favourite image is the middle one (k=2) showing planes in 3d activation space because we expect 1≪k≪dembed in realistic settings.
(nactive in the plot corresponds to k here. Code here.)
This picture predicts that interpolating between two activations should take you out-of-distribution relatively quickly (up to possibly some error correction) unless your interpolation (steering) direction exactly corresponds to a feature. I think this is relevant because
it implies my stable region experiment series [we observe models are robust to perturbations of their activations, 1, 2, 3, 4] should be quite severely out-of-distribution, which makes me even more confused about our results.
it predicts activation steering to be severely out-of-distribution unless you pick a steering direction that is aligned with (a linear combination of) active feature directions.
it predicts that linear probing shouldn’t give you nice continuous results: Probing into a feature direction should yield just interference noise most of the time (when the feature is inactive), and significant values only when the feature is active. Instead however, we typically observe non-negligible probe scores for most tokens.[2]
In the demo plots I assume exactly k features to be active. In reality we expect this to be a softer limit, for example L0=100±20 features active, but I believe that the qualitative conclusions still hold. The “allowed region” is just a bit softer, and looks more like the union of say a bunch of roughly 80 to 120 dimensional subspaces.
There’s various possible explanations of course, e.g. that we’re probing multiple features at once, or that the “deception feature” is just always active in these contexts (though consider these random Alpaca samples.)