Have you compared this method (finding vectors that change downstream activations as much as possible based on my understanding) with just using random vectors? (I didn’t see this in the post, but I might have just missed this.)
In particular, does that yield qualitatively similar results?
Naively, I would expect that this would be qualitatively similar for some norm of random vector. So, I’d be interested in some ablations of the technique.
If random vectors work, that would simplify the story somewhat: you can see salient and qualitatively distinct behaviors via randomly perturbing activations.
(Probably random vectors have to somewhat higher norm to yield qualitatively as large results to vectors which are optimized for changing downstream activations. However, I current don’t see a particular a priori (non-empirical) reason to think that there doesn’t exist some norm at which the results are similar.)
It’s a good experiment to run, but the answer is “no, the results are not similar.” From the post (the first bit of emphasis added):
I hypothesize that the reason why the method works is due to the noise-stability of deep nets. In particular, my subjective impression (from experiments) is that for random steering vectors, there is no Goldilocks value of R which leads to meaningfully different continuations. In fact, if we take random vectors with the same radius as “interesting” learned steering vectors, the random vectors typically lead to uninteresting re-phrasings of the model’s unsteered continuation, if they even lead to any changes (a fact previously observed by Turner et al. (2023))[7][8]. Thus, in some sense, learned vectors (or more generally, adapters) at the Golidlocks value of R are very special; the fact that they lead to any downstream changes at all is evidence that they place significant weight on structurally important directions in activation space[9].
I think @wesg’s recent post on pathological SAE reconstruction errors is relevant here. It points out that there are very particular directions such that intervening on activations along these directions significantly impacts downstream model behavior, while this is not the case for most randomly sampled directions.
Also see @jake_mendel’s great comment for an intuitive explanation of why (probably) this is the case.
Have you compared this method (finding vectors that change downstream activations as much as possible based on my understanding) with just using random vectors? (I didn’t see this in the post, but I might have just missed this.)
In particular, does that yield qualitatively similar results?
Naively, I would expect that this would be qualitatively similar for some norm of random vector. So, I’d be interested in some ablations of the technique.
If random vectors work, that would simplify the story somewhat: you can see salient and qualitatively distinct behaviors via randomly perturbing activations.
(Probably random vectors have to somewhat higher norm to yield qualitatively as large results to vectors which are optimized for changing downstream activations. However, I current don’t see a particular a priori (non-empirical) reason to think that there doesn’t exist some norm at which the results are similar.)
It’s a good experiment to run, but the answer is “no, the results are not similar.” From the post (the first bit of emphasis added):
Thanks! I feel dumb for missing that section. Interesting that this is so different from random.
I think @wesg’s recent post on pathological SAE reconstruction errors is relevant here. It points out that there are very particular directions such that intervening on activations along these directions significantly impacts downstream model behavior, while this is not the case for most randomly sampled directions.
Also see @jake_mendel’s great comment for an intuitive explanation of why (probably) this is the case.