nostalgebraist comments on I found >800 orthogonal “write code” steering vectors

nostalgebraist 16 Jul 2024 15:57 UTC
59 points
22
The result of averaging the first 20 generated orthogonal vectors [...]
Have you tried scaling up the resulting vector, after averaging, so that its norm is similar to the norms of the individual vectors that are being averaged?
If you take $n$ orthogonal vectors, all of which have norm $a$ , and average them, the norm of the average is (I think?) $a / \sqrt{n}$ .
As you note, the individual vectors don’t work if scaled down from norm 20 to norm 7. The norm will become this small once we are averaging 8 or more vectors, since $\sqrt{8} \approx 20 / 7$ , so we shouldn’t expect these averages to “work”—even the individual orthogonal vectors don’t work if they are scaled down this much.
Another way to look at it: suppose that these vectors do compose linearly, in the sense that adding several of them together will “combine” (in some intuitive sense) the effects we observe when steering with the vectors individually. But an average is the sum of $n$ vectors each of which is scaled down by $1 / n$ . Under this hypothesis, once $n \geq 20 / 7 \approx 3$ , we should expect the average to fail in the same way the individual vectors fail when scaled to norm 7, since the scaled-down individual vectors all fail, and so the “combination” of their elicited behaviors is also presumably a failure.^[1] (This hypothesis also implies that the “right” thing to do is simply summing vectors as opposed to averaging them.)
Both of these hypotheses, but especially the one in the previous paragraph, predict what you observed about averages generally producing similar behavior (refusals) to scaled-to-norm-7 vectors, and small- $n$ averages coming the closest to “working.” In any case, it’d be easy to check whether this is what is going on or not.
1. ^
  Note that here we are supposing that the norms of the individual “scaled summands” in the average are what matters, whereas in the previous paragraph we imagined the norm of the average vector was what mattered. Hence the scaling with $n$ (“scaled summands”) as opposed to $\sqrt{n}$ (“norm of average”). The “scaled summands” perspective makes somewhat more intuitive sense to me.
- Jacob G-W 16 Jul 2024 17:31 UTC
  11 points
  0
  Parent
  This seems to be right for the coding vectors! When I take the mean of the first $n$ vectors and then scale that by $\sqrt{n}$ , it also produces a coding vector.
  Here’s some sample output from using the scaled means of the first n coding vectors.
  With the scaled means of the alien vectors, the outputs have a similar pretty vibe as the original alien vectors, but don’t seem to talk about bombs as much.
  The STEM problem vector scaled means sometimes give more STEM problems but sometimes give jailbreaks. The jailbreaks say some pretty nasty stuff so I’m not going to post the results here.
  The jailbreak vector scaled means sometimes give more jailbreak vectors but also sometimes tell stories in the first or second person. I’m also not going to post the results for this one.