Fabien Roger comments on I found >800 orthogonal “write code” steering vectors

Fabien Roger 16 Jul 2024 0:55 UTC
13 points
3
[Edit: most of the math here is wrong, see comments below. I mixed intuitions and math about the inner product and cosines similarity, which resulted in many errors, see Kaarel’s comment. I edited my comment to only talk about inner products.]
[Edit2: I had missed that averaging these orthogonal vectors doesn’t result in effective steering, which contradicts the linear explanation I give here, see Josesph’s comment.]
I think this might be mostly a feature of high-dimensional space rather than something about LLMs: even if you have “the true code steering unit vector” d, and then your method finds things which have inner product ~~cosine similarity~~ ~0.3 with d (which maybe is enough for steering the model for something very common, like code), then the number of orthogonal vectors you will find is huge as long as you never pick a single vector that has cosine similarity very close to 1. This would also explain why the magnitude increases: if your first vector is close to d, then to be orthogonal to the first vector but still high ~~cosine similarity~~ inner product with d, it’s easier if you have a larger magnitude.
More formally, if theta0 = alpha0 d + (1 - alpha0) noise0, where d is a unit vector, and alpha0 = ~~cosine~~<theta0, d>, then for theta1 to have alpha1 cosine similarity while being orthogonal, you need alpha0alpha1 + <noise0, noise1>(1-alpha0)(1-alpha1) = 0, which is very easy to achieve if alpha0 = 0.6 and alpha1 = 0.3, especially if nosie1 has a big magnitude. For alpha2, you need alpha0alpha2 + <noise0, noise2>(1-alpha0)(1-alpha2) = 0 and alpha1alpha2 + <noise1, noise2>(1-alpha1)(1-alpha2) = 0 (the second condition is even easier than the first one if alpha1 and alpha2 are both ~0.3, and both noises are big). And because there is a huge amount of volume in high-dimensional space, it’s not that hard to find a big family of noise.
(Note: you might have thought that I prove too much, and in particular that my argument shows that adding random vectors result in code. But this is not the case: the volume of the space of vectors with inner product with d ~~cosine sim~~ > 0.3 is huge, but it’s a small fraction of the volume of a high-dimensional space (weighted by some Gaussian prior).) [Edit: maybe this proves too much? it depends what is actual magnitude needed to influence the behavior and how big are the random vector you would draw]
But there is still a mystery I don’t fully understand: how is it possible to find so many “noise” vectors that don’t influence the output of the network much.
(Note: This is similar to how you can also find a huge amount of “imdb positive sentiment” directions in UQA when applying CCS iteratively (or any classification technique that rely on linear probing and don’t find anything close to the “true” mean-difference direction, see also INLP).)
- Kaarel 16 Jul 2024 5:55 UTC
  36 points
  7
  Parent
  I think most of the quantitative claims in the current version of the above comment are false/nonsense/[using terms non-standardly]. (Caveat: I only skimmed the original post.)
  
  “if your first vector has cosine similarity 0.6 with d, then to be orthogonal to the first vector but still high cosine similarity with d, it’s easier if you have a larger magnitude”
  
  If by ‘cosine similarity’ you mean what’s usually meant, which I take to be the cosine of the angle between two vectors, then the cosine only depends on the directions of vectors, not their magnitudes. (Some parts of your comment look like you meant to say ‘dot product’/‘projection’ when you said ‘cosine similarity’, but I don’t think making this substitution everywhere makes things make sense overall either.)
  
  “then your method finds things which have cosine similarity ~0.3 with d (which maybe is enough for steering the model for something very common, like code), then the number of orthogonal vectors you will find is huge as long as you never pick a single vector that has cosine similarity very close to 1”
  
  For 0.3 in particular, the number of orthogonal vectors with at least that cosine with a given vector d is actually small. Assuming I calculated correctly, the number of e.g. pairwise-dot-prod-less-than-0.01 unit vectors with that cosine with a given vector is at most $23$ (the ambient dimension does not show up in this upper bound). I provide the calculation later in my comment.
  
  “More formally, if theta0 = alpha0 d + (1 - alpha0) noise0, where d is a unit vector, and alpha0 = cosine(theta0, d), then for theta1 to have alpha1 cosine similarity while being orthogonal, you need alpha0alpha1 + <noise0, noise1>(1-alpha0)(1-alpha1) = 0, which is very easy to achieve if alpha0 = 0.6 and alpha1 = 0.3, especially if nosie1 has a big magnitude.”
  
  This doesn’t make sense. For alpha1 to be cos(theta1, d), you can’t freely choose the magnitude of noise1
  
  How many nearly-orthogonal vectors can you fit in a spherical cap?
  
  Proposition. Let $d \in R^{n}$ be a unit vector and let $θ_{1}, \dots, θ_{m} \in R^{n}$ also be unit vectors such that they all sorta point in the $d$ direction, i.e., $θ_{i} \cdot d \geq δ$ for a constant $δ > 0$ (I take you to have taken $δ = 0.3$ ), and such that the $θ_{i}$ are nearly orthogonal, i.e., $| θ_{i} \cdot θ_{j} | \leq ϵ$ for all $i \neq j$ , for another constant $ϵ > 0$ . Assume also that $ϵ < δ^{2}$ . Then $m \leq 2 \frac{1 - δ^{2}}{δ^{2} - ϵ} + 1$ .
  
  Proof. We can decompose $θ_{i} = α_{i} d + \sqrt{1 - α_{i}^{2}} u_{i}$ , with $u_{i}$ a unit vector orthogonal to $d$ ; then $α_{i} \geq δ$ . Given $ϵ < δ$ , it’s a 3d geometry exercise to show that pushing all vectors to the boundary of the spherical cap around $d$ can only decrease each pairwise dot product; doing this gives a new collection of unit vectors $v_{i} = δ d + \sqrt{1 - δ^{2}} u_{i}$ , still with $ϵ \geq v_{i} \cdot v_{j} = δ^{2} + (1 - δ^{2}) u_{i} \cdot u_{j}$ . This implies that $u_{i} \cdot u_{j} \leq - \frac{δ^{2} - ϵ}{1 - δ^{2}}$ . Note that since $ϵ < δ^{2}$ , the RHS is some negative constant. Consider ${(\sum_{i} u_{i})}_{i}^{2}$ . On the one hand, it has to be positive. On the other hand, expanding it, we get that it’s at most $m - (\frac{m}{2}) \frac{δ^{2} - ϵ}{1 - δ^{2}}$ . From this, $0 \leq 1 - \frac{m - 1}{2} \frac{δ^{2} - ϵ}{1 - δ^{2}}$ , whence $m \leq 2 \frac{1 - δ^{2}}{δ^{2} - ϵ} + 1$ .
  
  (acknowledgements: I learned this from some combination of Dmitry Vaintrob and https://mathoverflow.net/questions/24864/almost-orthogonal-vectors/24887#24887 )
  
  For example, for $δ = 0.3$ and $ϵ = 0.01$ , this gives $m \leq 23$ .
  
  (I believe this upper bound for the number of almost-orthogonal vectors is actually basically exactly met in sufficiently high dimensions — I can probably provide a proof (sketch) if anyone expresses interest.)
  
  Remark. If $ϵ > δ^{2}$ , then one starts to get exponentially many vectors in the dimension again, as one can see by picking a bunch of random vectors on the boundary of the spherical cap.
  
  What about the philosophical point? (low-quality section)
  
  Ok, the math seems to have issues, but does the philosophical point stand up to scrutiny? Idk, maybe — I haven’t really read the post to check relevant numbers or to extract all the pertinent bits to answer this well. It’s possible it goes through with a significantly smaller $δ$ or if the vectors weren’t really that orthogonal or something. (To give a better answer, the first thing I’d try to understand is whether this behavior is basically first-order — more precisely, is there some reasonable loss function on perturbations on the relevant activation space which captures perturbations being coding perturbations, and are all of these vectors first-order perturbations toward coding in this sense? If the answer is yes, then there just has to be such a vector $d$ — it’d just be the gradient of this loss.)
  - StefanHex 16 Jul 2024 12:29 UTC
    11 points
    2
    Parent
    Hmm, with that we’d need $δ \leq 0.05$ to get 800 orthogonal vectors.^[1] This seems pretty workable. If we take the MELBO vector magnitude change (7 → 20) as an indication of how much the cosine similarity changes, then this is consistent with $δ = 0.15$ for the original vector. This seems plausible for a steering vector?
    ^
    Thanks to @Lucius Bushnaq for correcting my earlier wrong number
  - Fabien Roger 16 Jul 2024 15:54 UTC
    7 points
    2
    Parent
    You’re right, I mixed intuitions and math about the inner product and cosines similarity, which resulted in many errors. I added a disclaimer at the top of my comment. Sorry for my sloppy math, and thank you for pointing it out.
    I think my math is right if only looking at the inner product between d and theta, not about the cosine similarity. So I think my original intuition still hold.
  - [ ]
    [deleted]
- Joseph Miller 16 Jul 2024 5:00 UTC
  12 points
  1
  Parent
  If this were the case, wouldn’t you expect the mean of the code steering vectors to also be a good code steering vector? ~~But in fact, Jacob says that this is not case.~~ Edit: Actually it does work when scaled—see nostalgebraist’s comment.
  - Fabien Roger 17 Jul 2024 23:32 UTC
    3 points
    0
    Parent
    I think this still contradicts my model: mean_i(<d, theta_i>) = <d, mean_i(theta_i)> therefore if the effect is linear, you would expect the mean to preserve the effect even if the random noise between the theta_i is greatly reduced.
  - Fabien Roger 16 Jul 2024 16:06 UTC
    3 points
    0
    Parent
    Good catch. I had missed that. This suggest something non-linear stuff is happening.
- StefanHex 16 Jul 2024 12:33 UTC
  6 points
  2
  Parent
  But there is still a mystery I don’t fully understand: how is it possible to find so many “noise” vectors that don’t influence the output of the network much.
  In unrelated experiments I found that steering into a (uniform) random direction is much less effective, than steering into a random direction sampled with same covariance as the real activations. This suggests that there might be a lot of directions^[1] that don’t influence the output of the network much. This was on GPT2 but I’d expect it to generalize for other Transformers.
  1. ^
    Though I don’t know how much space / what the dimensionality of that space is; I’m judging this by the “sensitivity curve” (how much steering is needed for a noticeable change in KL divergence).
- Jacob G-W 16 Jul 2024 17:44 UTC
  1 point
  0
  Parent
  Maybe you are right, since averaging and scaling does result in pretty good steering (especially for coding). See here.

Fabien Roger comments on I found >800 orthogonal “write code” steering vectors

How many nearly-orthogonal vectors can you fit in a spherical cap?

What about the philosophical point? (low-quality section)