ryan_greenblatt comments on Mechanistically Eliciting Latent Behaviors in Language Models

ryan_greenblatt 30 Apr 2024 19:37 UTC
LW: 10 AF: 8
2
AF
Have you compared this method (finding vectors that change downstream activations as much as possible based on my understanding) with just using random vectors? (I didn’t see this in the post, but I might have just missed this.)

In particular, does that yield qualitatively similar results?

Naively, I would expect that this would be qualitatively similar for some norm of random vector. So, I’d be interested in some ablations of the technique.

If random vectors work, that would simplify the story somewhat: you can see salient and qualitatively distinct behaviors via randomly perturbing activations.

(Probably random vectors have to somewhat higher norm to yield qualitatively as large results to vectors which are optimized for changing downstream activations. However, I current don’t see a particular a priori (non-empirical) reason to think that there doesn’t exist some norm at which the results are similar.)
- TurnTrout 30 Apr 2024 21:10 UTC
  LW: 20 AF: 14
  5
  AF Parent
  It’s a good experiment to run, but the answer is “no, the results are not similar.” From the post (the first bit of emphasis added):
  I hypothesize that the reason why the method works is due to the noise-stability of deep nets. In particular, my subjective impression (from experiments) is that for random steering vectors, there is no Goldilocks value of $R$ which leads to meaningfully different continuations. In fact, if we take random vectors with the same radius as “interesting” learned steering vectors, the random vectors typically lead to uninteresting re-phrasings of the model’s unsteered continuation, if they even lead to any changes (a fact previously observed by Turner et al. (2023))^[7]^[8]. Thus, in some sense, learned vectors (or more generally, adapters) at the Golidlocks value of $R$ are very special; the fact that they lead to any downstream changes at all is evidence that they place significant weight on structurally important directions in activation space^[9].
  - ryan_greenblatt 30 Apr 2024 21:29 UTC
    LW: 7 AF: 7
    0
    AF Parent
    Thanks! I feel dumb for missing that section. Interesting that this is so different from random.
- Andy Arditi 3 May 2024 15:13 UTC
  3 points
  1
  Parent
  I think @wesg’s recent post on pathological SAE reconstruction errors is relevant here. It points out that there are very particular directions such that intervening on activations along these directions significantly impacts downstream model behavior, while this is not the case for most randomly sampled directions.
  Also see @jake_mendel’s great comment for an intuitive explanation of why (probably) this is the case.