[Edit: most of the math here is wrong, see comments below. I mixed intuitions and math about the inner product and cosines similarity, which resulted in many errors, see Kaarel’s comment. I edited my comment to only talk about inner products.]
[Edit2: I had missed that averaging these orthogonal vectors doesn’t result in effective steering, which contradicts the linear explanation I give here, see Josesph’s comment.]
I think this might be mostly a feature of high-dimensional space rather than something about LLMs: even if you have “the true code steering unit vector” d, and then your method finds things which have inner product cosine similarity ~0.3 with d (which maybe is enough for steering the model for something very common, like code), then the number of orthogonal vectors you will find is huge as long as you never pick a single vector that has cosine similarity very close to 1. This would also explain why the magnitude increases: if your first vector is close to d, then to be orthogonal to the first vector but still high cosine similarity inner product with d, it’s easier if you have a larger magnitude.
More formally, if theta0 = alpha0 d + (1 - alpha0) noise0, where d is a unit vector, and alpha0 = cosine<theta0, d>, then for theta1 to have alpha1 cosine similarity while being orthogonal, you need alpha0alpha1 + <noise0, noise1>(1-alpha0)(1-alpha1) = 0, which is very easy to achieve if alpha0 = 0.6 and alpha1 = 0.3, especially if nosie1 has a big magnitude. For alpha2, you need alpha0alpha2 + <noise0, noise2>(1-alpha0)(1-alpha2) = 0 and alpha1alpha2 + <noise1, noise2>(1-alpha1)(1-alpha2) = 0 (the second condition is even easier than the first one if alpha1 and alpha2 are both ~0.3, and both noises are big). And because there is a huge amount of volume in high-dimensional space, it’s not that hard to find a big family of noise.
(Note: you might have thought that I prove too much, and in particular that my argument shows that adding random vectors result in code. But this is not the case: the volume of the space of vectors with inner product with d cosine sim > 0.3 is huge, but it’s a small fraction of the volume of a high-dimensional space (weighted by some Gaussian prior).) [Edit: maybe this proves too much? it depends what is actual magnitude needed to influence the behavior and how big are the random vector you would draw]
But there is still a mystery I don’t fully understand: how is it possible to find so many “noise” vectors that don’t influence the output of the network much.
(Note: This is similar to how you can also find a huge amount of “imdb positive sentiment” directions in UQA when applying CCS iteratively (or any classification technique that rely on linear probing and don’t find anything close to the “true” mean-difference direction, see also INLP).)
I think most of the quantitative claims in the current version of the above comment are false/nonsense/[using terms non-standardly]. (Caveat: I only skimmed the original post.)
“if your first vector has cosine similarity 0.6 with d, then to be orthogonal to the first vector but still high cosine similarity with d, it’s easier if you have a larger magnitude”
If by ‘cosine similarity’ you mean what’s usually meant, which I take to be the cosine of the angle between two vectors, then the cosine only depends on the directions of vectors, not their magnitudes. (Some parts of your comment look like you meant to say ‘dot product’/‘projection’ when you said ‘cosine similarity’, but I don’t think making this substitution everywhere makes things make sense overall either.)
“then your method finds things which have cosine similarity ~0.3 with d (which maybe is enough for steering the model for something very common, like code), then the number of orthogonal vectors you will find is huge as long as you never pick a single vector that has cosine similarity very close to 1”
For 0.3 in particular, the number of orthogonal vectors with at least that cosine with a given vector d is actually small. Assuming I calculated correctly, the number of e.g. pairwise-dot-prod-less-than-0.01 unit vectors with that cosine with a given vector is at most 23 (the ambient dimension does not show up in this upper bound). I provide the calculation later in my comment.
“More formally, if theta0 = alpha0 d + (1 - alpha0) noise0, where d is a unit vector, and alpha0 = cosine(theta0, d), then for theta1 to have alpha1 cosine similarity while being orthogonal, you need alpha0alpha1 + <noise0, noise1>(1-alpha0)(1-alpha1) = 0, which is very easy to achieve if alpha0 = 0.6 and alpha1 = 0.3, especially if nosie1 has a big magnitude.”
This doesn’t make sense. For alpha1 to be cos(theta1, d), you can’t freely choose the magnitude of noise1
How many nearly-orthogonal vectors can you fit in a spherical cap?
Proposition. Let d∈Rn be a unit vector and let θ1,…,θm∈Rn also be unit vectors such that they all sorta point in the d direction, i.e., θi⋅d≥δ for a constant δ>0 (I take you to have taken δ=0.3), and such that the θi are nearly orthogonal, i.e., |θi⋅θj|≤ϵ for all i≠j, for another constant ϵ>0. Assume also that ϵ<δ2. Then m≤21−δ2δ2−ϵ+1.
Proof. We can decompose θi=αid+√1−α2iui, with ui a unit vector orthogonal to d; then αi≥δ. Given ϵ<δ, it’s a 3d geometry exercise to show that pushing all vectors to the boundary of the spherical cap around d can only decrease each pairwise dot product; doing this gives a new collection of unit vectors vi=δd+√1−δ2ui, still with ϵ≥vi⋅vj=δ2+(1−δ2)ui⋅uj. This implies that ui⋅uj≤−δ2−ϵ1−δ2. Note that since ϵ<δ2, the RHS is some negative constant. Consider (∑iui)2. On the one hand, it has to be positive. On the other hand, expanding it, we get that it’s at most m−(m2)δ2−ϵ1−δ2. From this, 0≤1−m−12δ2−ϵ1−δ2, whence m≤21−δ2δ2−ϵ+1.
For example, for δ=0.3 and ϵ=0.01, this gives m≤23.
(I believe this upper bound for the number of almost-orthogonal vectors is actually basically exactly met in sufficiently high dimensions — I can probably provide a proof (sketch) if anyone expresses interest.)
Remark. If ϵ>δ2, then one starts to get exponentially many vectors in the dimension again, as one can see by picking a bunch of random vectors on the boundary of the spherical cap.
What about the philosophical point? (low-quality section)
Ok, the math seems to have issues, but does the philosophical point stand up to scrutiny? Idk, maybe — I haven’t really read the post to check relevant numbers or to extract all the pertinent bits to answer this well. It’s possible it goes through with a significantly smaller δ or if the vectors weren’t really that orthogonal or something. (To give a better answer, the first thing I’d try to understand is whether this behavior is basically first-order — more precisely, is there some reasonable loss function on perturbations on the relevant activation space which captures perturbations being coding perturbations, and are all of these vectors first-order perturbations toward coding in this sense? If the answer is yes, then there just has to be such a vector d — it’d just be the gradient of this loss.)
Hmm, with that we’d need δ≤0.05 to get 800 orthogonal vectors.[1] This seems pretty workable. If we take the MELBO vector magnitude change (7 → 20) as an indication of how much the cosine similarity changes, then this is consistent with δ=0.15 for the original vector. This seems plausible for a steering vector?
You’re right, I mixed intuitions and math about the inner product and cosines similarity, which resulted in many errors. I added a disclaimer at the top of my comment. Sorry for my sloppy math, and thank you for pointing it out.
I think my math is right if only looking at the inner product between d and theta, not about the cosine similarity. So I think my original intuition still hold.
If this were the case, wouldn’t you expect the mean of the code steering vectors to also be a good code steering vector? But in fact, Jacob says that this is not case. Edit: Actually it does work when scaled—see nostalgebraist’s comment.
I think this still contradicts my model: mean_i(<d, theta_i>) = <d, mean_i(theta_i)> therefore if the effect is linear, you would expect the mean to preserve the effect even if the random noise between the theta_i is greatly reduced.
But there is still a mystery I don’t fully understand: how is it possible to find so many “noise” vectors that don’t influence the output of the network much.
In unrelated experiments I found that steering into a (uniform) random direction is much less effective, than steering into a random direction sampled with same covariance as the real activations. This suggests that there might be a lot of directions[1] that don’t influence the output of the network much. This was on GPT2 but I’d expect it to generalize for other Transformers.
Though I don’t know how much space / what the dimensionality of that space is; I’m judging this by the “sensitivity curve” (how much steering is needed for a noticeable change in KL divergence).
[Edit: most of the math here is wrong, see comments below. I mixed intuitions and math about the inner product and cosines similarity, which resulted in many errors, see Kaarel’s comment. I edited my comment to only talk about inner products.]
[Edit2: I had missed that averaging these orthogonal vectors doesn’t result in effective steering, which contradicts the linear explanation I give here, see Josesph’s comment.]
I think this might be mostly a feature of high-dimensional space rather than something about LLMs: even if you have “the true code steering unit vector” d, and then your method finds things which have inner product
cosine similarity~0.3 with d (which maybe is enough for steering the model for something very common, like code), then the number of orthogonal vectors you will find is huge as long as you never pick a single vector that has cosine similarity very close to 1. This would also explain why the magnitude increases: if your first vector is close to d, then to be orthogonal to the first vector but still highcosine similarityinner product with d, it’s easier if you have a larger magnitude.More formally, if theta0 = alpha0 d + (1 - alpha0) noise0, where d is a unit vector, and alpha0 =
cosine<theta0, d>, then for theta1 to have alpha1 cosine similarity while being orthogonal, you need alpha0alpha1 + <noise0, noise1>(1-alpha0)(1-alpha1) = 0, which is very easy to achieve if alpha0 = 0.6 and alpha1 = 0.3, especially if nosie1 has a big magnitude. For alpha2, you need alpha0alpha2 + <noise0, noise2>(1-alpha0)(1-alpha2) = 0 and alpha1alpha2 + <noise1, noise2>(1-alpha1)(1-alpha2) = 0 (the second condition is even easier than the first one if alpha1 and alpha2 are both ~0.3, and both noises are big). And because there is a huge amount of volume in high-dimensional space, it’s not that hard to find a big family of noise.(Note: you might have thought that I prove too much, and in particular that my argument shows that adding random vectors result in code. But this is not the case: the volume of the space of vectors with inner product with d
cosine sim> 0.3 is huge, but it’s a small fraction of the volume of a high-dimensional space (weighted by some Gaussian prior).) [Edit: maybe this proves too much? it depends what is actual magnitude needed to influence the behavior and how big are the random vector you would draw]But there is still a mystery I don’t fully understand: how is it possible to find so many “noise” vectors that don’t influence the output of the network much.
(Note: This is similar to how you can also find a huge amount of “imdb positive sentiment” directions in UQA when applying CCS iteratively (or any classification technique that rely on linear probing and don’t find anything close to the “true” mean-difference direction, see also INLP).)
I think most of the quantitative claims in the current version of the above comment are false/nonsense/[using terms non-standardly]. (Caveat: I only skimmed the original post.)
If by ‘cosine similarity’ you mean what’s usually meant, which I take to be the cosine of the angle between two vectors, then the cosine only depends on the directions of vectors, not their magnitudes. (Some parts of your comment look like you meant to say ‘dot product’/‘projection’ when you said ‘cosine similarity’, but I don’t think making this substitution everywhere makes things make sense overall either.)
For 0.3 in particular, the number of orthogonal vectors with at least that cosine with a given vector d is actually small. Assuming I calculated correctly, the number of e.g. pairwise-dot-prod-less-than-0.01 unit vectors with that cosine with a given vector is at most 23 (the ambient dimension does not show up in this upper bound). I provide the calculation later in my comment.
This doesn’t make sense. For alpha1 to be cos(theta1, d), you can’t freely choose the magnitude of noise1
How many nearly-orthogonal vectors can you fit in a spherical cap?
Proposition. Let d∈Rn be a unit vector and let θ1,…,θm∈Rn also be unit vectors such that they all sorta point in the d direction, i.e., θi⋅d≥δ for a constant δ>0 (I take you to have taken δ=0.3), and such that the θi are nearly orthogonal, i.e., |θi⋅θj|≤ϵ for all i≠j, for another constant ϵ>0. Assume also that ϵ<δ2. Then m≤21−δ2δ2−ϵ+1.
Proof. We can decompose θi=αid+√1−α2iui, with ui a unit vector orthogonal to d; then αi≥δ. Given ϵ<δ, it’s a 3d geometry exercise to show that pushing all vectors to the boundary of the spherical cap around d can only decrease each pairwise dot product; doing this gives a new collection of unit vectors vi=δd+√1−δ2ui, still with ϵ≥vi⋅vj=δ2+(1−δ2)ui⋅uj. This implies that ui⋅uj≤−δ2−ϵ1−δ2. Note that since ϵ<δ2, the RHS is some negative constant. Consider (∑iui)2. On the one hand, it has to be positive. On the other hand, expanding it, we get that it’s at most m−(m2)δ2−ϵ1−δ2. From this, 0≤1−m−12δ2−ϵ1−δ2, whence m≤21−δ2δ2−ϵ+1.
(acknowledgements: I learned this from some combination of Dmitry Vaintrob and https://mathoverflow.net/questions/24864/almost-orthogonal-vectors/24887#24887 )
For example, for δ=0.3 and ϵ=0.01, this gives m≤23.
(I believe this upper bound for the number of almost-orthogonal vectors is actually basically exactly met in sufficiently high dimensions — I can probably provide a proof (sketch) if anyone expresses interest.)
Remark. If ϵ>δ2, then one starts to get exponentially many vectors in the dimension again, as one can see by picking a bunch of random vectors on the boundary of the spherical cap.
What about the philosophical point? (low-quality section)
Ok, the math seems to have issues, but does the philosophical point stand up to scrutiny? Idk, maybe — I haven’t really read the post to check relevant numbers or to extract all the pertinent bits to answer this well. It’s possible it goes through with a significantly smaller δ or if the vectors weren’t really that orthogonal or something. (To give a better answer, the first thing I’d try to understand is whether this behavior is basically first-order — more precisely, is there some reasonable loss function on perturbations on the relevant activation space which captures perturbations being coding perturbations, and are all of these vectors first-order perturbations toward coding in this sense? If the answer is yes, then there just has to be such a vector d — it’d just be the gradient of this loss.)
Hmm, with that we’d need δ≤0.05 to get 800 orthogonal vectors.[1] This seems pretty workable. If we take the MELBO vector magnitude change (7 → 20) as an indication of how much the cosine similarity changes, then this is consistent with δ=0.15 for the original vector. This seems plausible for a steering vector?
Thanks to @Lucius Bushnaq for correcting my earlier wrong number
You’re right, I mixed intuitions and math about the inner product and cosines similarity, which resulted in many errors. I added a disclaimer at the top of my comment. Sorry for my sloppy math, and thank you for pointing it out.
I think my math is right if only looking at the inner product between d and theta, not about the cosine similarity. So I think my original intuition still hold.
If this were the case, wouldn’t you expect the mean of the code steering vectors to also be a good code steering vector?
But in fact, Jacob says that this is not case.Edit: Actually it does work when scaled—see nostalgebraist’s comment.I think this still contradicts my model: mean_i(<d, theta_i>) = <d, mean_i(theta_i)> therefore if the effect is linear, you would expect the mean to preserve the effect even if the random noise between the theta_i is greatly reduced.
Good catch. I had missed that. This suggest something non-linear stuff is happening.
In unrelated experiments I found that steering into a (uniform) random direction is much less effective, than steering into a random direction sampled with same covariance as the real activations. This suggests that there might be a lot of directions[1] that don’t influence the output of the network much. This was on GPT2 but I’d expect it to generalize for other Transformers.
Though I don’t know how much space / what the dimensionality of that space is; I’m judging this by the “sensitivity curve” (how much steering is needed for a noticeable change in KL divergence).
Maybe you are right, since averaging and scaling does result in pretty good steering (especially for coding). See here.