Kaarel comments on I found >800 orthogonal “write code” steering vectors

Kaarel 16 Jul 2024 5:55 UTC
36 points
7
I think most of the quantitative claims in the current version of the above comment are false/nonsense/[using terms non-standardly]. (Caveat: I only skimmed the original post.)

“if your first vector has cosine similarity 0.6 with d, then to be orthogonal to the first vector but still high cosine similarity with d, it’s easier if you have a larger magnitude”

If by ‘cosine similarity’ you mean what’s usually meant, which I take to be the cosine of the angle between two vectors, then the cosine only depends on the directions of vectors, not their magnitudes. (Some parts of your comment look like you meant to say ‘dot product’/‘projection’ when you said ‘cosine similarity’, but I don’t think making this substitution everywhere makes things make sense overall either.)

“then your method finds things which have cosine similarity ~0.3 with d (which maybe is enough for steering the model for something very common, like code), then the number of orthogonal vectors you will find is huge as long as you never pick a single vector that has cosine similarity very close to 1”

For 0.3 in particular, the number of orthogonal vectors with at least that cosine with a given vector d is actually small. Assuming I calculated correctly, the number of e.g. pairwise-dot-prod-less-than-0.01 unit vectors with that cosine with a given vector is at most $23$ (the ambient dimension does not show up in this upper bound). I provide the calculation later in my comment.

“More formally, if theta0 = alpha0 d + (1 - alpha0) noise0, where d is a unit vector, and alpha0 = cosine(theta0, d), then for theta1 to have alpha1 cosine similarity while being orthogonal, you need alpha0alpha1 + <noise0, noise1>(1-alpha0)(1-alpha1) = 0, which is very easy to achieve if alpha0 = 0.6 and alpha1 = 0.3, especially if nosie1 has a big magnitude.”

This doesn’t make sense. For alpha1 to be cos(theta1, d), you can’t freely choose the magnitude of noise1

How many nearly-orthogonal vectors can you fit in a spherical cap?

Proposition. Let $d \in R^{n}$ be a unit vector and let $θ_{1}, \dots, θ_{m} \in R^{n}$ also be unit vectors such that they all sorta point in the $d$ direction, i.e., $θ_{i} \cdot d \geq δ$ for a constant $δ > 0$ (I take you to have taken $δ = 0.3$ ), and such that the $θ_{i}$ are nearly orthogonal, i.e., $| θ_{i} \cdot θ_{j} | \leq ϵ$ for all $i \neq j$ , for another constant $ϵ > 0$ . Assume also that $ϵ < δ^{2}$ . Then $m \leq 2 \frac{1 - δ^{2}}{δ^{2} - ϵ} + 1$ .

Proof. We can decompose $θ_{i} = α_{i} d + \sqrt{1 - α_{i}^{2}} u_{i}$ , with $u_{i}$ a unit vector orthogonal to $d$ ; then $α_{i} \geq δ$ . Given $ϵ < δ$ , it’s a 3d geometry exercise to show that pushing all vectors to the boundary of the spherical cap around $d$ can only decrease each pairwise dot product; doing this gives a new collection of unit vectors $v_{i} = δ d + \sqrt{1 - δ^{2}} u_{i}$ , still with $ϵ \geq v_{i} \cdot v_{j} = δ^{2} + (1 - δ^{2}) u_{i} \cdot u_{j}$ . This implies that $u_{i} \cdot u_{j} \leq - \frac{δ^{2} - ϵ}{1 - δ^{2}}$ . Note that since $ϵ < δ^{2}$ , the RHS is some negative constant. Consider ${(\sum_{i} u_{i})}_{i}^{2}$ . On the one hand, it has to be positive. On the other hand, expanding it, we get that it’s at most $m - (\frac{m}{2}) \frac{δ^{2} - ϵ}{1 - δ^{2}}$ . From this, $0 \leq 1 - \frac{m - 1}{2} \frac{δ^{2} - ϵ}{1 - δ^{2}}$ , whence $m \leq 2 \frac{1 - δ^{2}}{δ^{2} - ϵ} + 1$ .

(acknowledgements: I learned this from some combination of Dmitry Vaintrob and https://mathoverflow.net/questions/24864/almost-orthogonal-vectors/24887#24887 )

For example, for $δ = 0.3$ and $ϵ = 0.01$ , this gives $m \leq 23$ .

(I believe this upper bound for the number of almost-orthogonal vectors is actually basically exactly met in sufficiently high dimensions — I can probably provide a proof (sketch) if anyone expresses interest.)

Remark. If $ϵ > δ^{2}$ , then one starts to get exponentially many vectors in the dimension again, as one can see by picking a bunch of random vectors on the boundary of the spherical cap.

What about the philosophical point? (low-quality section)

Ok, the math seems to have issues, but does the philosophical point stand up to scrutiny? Idk, maybe — I haven’t really read the post to check relevant numbers or to extract all the pertinent bits to answer this well. It’s possible it goes through with a significantly smaller $δ$ or if the vectors weren’t really that orthogonal or something. (To give a better answer, the first thing I’d try to understand is whether this behavior is basically first-order — more precisely, is there some reasonable loss function on perturbations on the relevant activation space which captures perturbations being coding perturbations, and are all of these vectors first-order perturbations toward coding in this sense? If the answer is yes, then there just has to be such a vector $d$ — it’d just be the gradient of this loss.)
- StefanHex 16 Jul 2024 12:29 UTC
  11 points
  2
  Parent
  Hmm, with that we’d need $δ \leq 0.05$ to get 800 orthogonal vectors.^[1] This seems pretty workable. If we take the MELBO vector magnitude change (7 → 20) as an indication of how much the cosine similarity changes, then this is consistent with $δ = 0.15$ for the original vector. This seems plausible for a steering vector?
  1. ^
    Thanks to @Lucius Bushnaq for correcting my earlier wrong number
- Fabien Roger 16 Jul 2024 15:54 UTC
  7 points
  2
  Parent
  You’re right, I mixed intuitions and math about the inner product and cosines similarity, which resulted in many errors. I added a disclaimer at the top of my comment. Sorry for my sloppy math, and thank you for pointing it out.
  I think my math is right if only looking at the inner product between d and theta, not about the cosine similarity. So I think my original intuition still hold.
- [ ]
  [deleted]

Kaarel comments on I found >800 orthogonal “write code” steering vectors

How many nearly-orthogonal vectors can you fit in a spherical cap?

What about the philosophical point? (low-quality section)