TurnTrout comments on Mechanistically Eliciting Latent Behaviors in Language Models

TurnTrout 30 Apr 2024 19:11 UTC
LW: 27 AF: 20
5
AF
I’m really excited about Andrew’s discovery here. With it, maybe we can get a more complete picture of what these models can do, and how. This feels like the most promising new idea I’ve seen in a while. I expect it to open up a few new affordances and research directions. Time will tell how reliable and scalable this technique is. I sure hope this technique gets the attention and investigation it (IMO) deserves.
More technically, his discovery unlocks the possibility of unsupervised capability elicitation, whereby we can automatically discover a subset of “nearby” abilities and behavioral “modes”, without the intervention itself “teaching” the model the elicited ability or information.
- tailcalled 1 May 2024 8:08 UTC
  LW: 6 AF: 4
  1
  AF Parent
  I think one characteristic of steering vectors constructed this way is that they are allowed to be off-manifold, so they don’t necessarily tell us how the networks currently work, rather than how they can be made to work with adaptations.
  
  For the past for weeks, I’ve been thinking about to interpret networks on-manifold. The most straightforward approach I could come up with was to restrict oneself to the space of activations that actually occur for a given prompt, e.g. by performing SVD of the token x activation matrix for a given layer, and then restricting oneself to changes that occur along the right singular vectors of this.
  
  My SVD idea might improve things, but I didn’t get around to testing it because I eventually decided that it wasn’t good enough for my purposes because 1) it wouldn’t keep you on-manifold enough because it could still introduce unnatural placement of information and exaggerated features, 2) given that transformers are pretty flexible and you can e.g. swap around layers, it felt unclean to have a method that’s this strongly dependent on the layer structure.
  
  A followup idea I’ve been thinking about but haven’t been able to be satisfied with is, projections. Like if you pick some vector u, and project the activations (or weights, in my theory, but you work more with activations so let’s consider activations, it should be similar) onto u, and then subtract off that projection from the original activations, you get “the activations with u removed”, which intuitively seems like it would better focus on “what the network actually does” as opposed to “what the network could do if you added something more to it”.
  
  Unfortunately after thinking for a while, I started thinking this actually wouldn’t work. Let’s say the activations a = x b + y c, where b is a large activation vector that ultimately doesn’t have an effect on the final prediction, and c is a small activation vector that does have an effect. If you have some vector d that the network doesn’t use at all, you could then project away sqrt(1/2) (b-d), which would introduce the d vector into the activations.
  
  Another idea I’ve thought about is, suppose you do SVD of the activations. You would multiply the feature half of the SVD with the weights used to compute the KQV matrices, and then perform SVD of that, which should get you the independent ways that one layer affects the next layer. One thing I in particular wonder about is, if you start doing this from the output layer, and proceed backwards, it seems like this would have the effect of “sparsifying” the network down to only the dimensions which matter for the final output, which seems like it should assist in interpretability and such. But it’s not clear it interacts nicely with the residual network element.
  - Jordan Taylor 30 Nov 2024 1:53 UTC
    1 point
    0
    Parent
    Couldn’t you do something like fit a Gaussian to the model’s activations, then restrict the steered activations to be high likelihood (low Mahalanobis distance)? Or (almost) equivalently, you could just do a whitening transformation to activation space before you constrain the L2 distance of the perturbation.
    
    (If a gaussian isn’t expressive enough you could model the manifold in some other way, eg. with a VAE anomaly detector or mixture of gaussians or whatever)
  - 4gate 6 May 2024 18:31 UTC
    1 point
    0
    AF Parent
    Maybe a dumb question but (1) how can we know for sure if we are on manifold, (2) why is it so important to stay on manifold? I’m guessing that you mean that vaguely we want to stay within the space of possible activations induced by inputs from data that is in some sense “real-world.” However, there appear to be a couple complications: (1) measuring distributional properties of later layers from small to medium sized datasets doesn’t seem like a realistic estimate of what should be expected of an on-manifold vector since it’s likely later layers are more semantically/high-level focused and sparse; (2) what people put into the inputs does change in small ways simply due to new things happening in the world, but also there are prompt engineering attacks that people use that are likely in some sense “off-distribution” but still in the real world and I don’t think we should ignore these fully. Is this notion of a manifold a good way to think about the notion of getting indicative information of real world behavior? Probably, but I’m not sure so I thought I might ask. I am new to this field.
    I do thing at the end of the day we want indicative information, so I think somewhat artifical environments might at times have a certain usefulness.
    Also one convoluted (perhaps inefficient) idea but which felt kind of fun to stay on manifold is to do the following: (1) train your batch of steering vectors, (2) optimize in token space to elicit those steering vectors (i.e. by regularizing for the vectors to be close to one of the token vectors or by using an algorithm that operates on text), (3) check those tokens to make sure that they continue to elicit the behavior and are not totally wacky. If you cannot generate that steer from something that is close to a prompt, surely it’s not on manifold right? You might be able to automate by looking at perplexity or training a small model to estimate that an input prompt is a “realistic” sentence or whatever.
    Curious to hear thoughts :)
    - tailcalled 7 May 2024 13:59 UTC
      2 points
      0
      Parent
      I think it’s easier to see the significance if you imagine the neural networks as a human-designed system. In e.g. a computer program, there’s a clear distinction between the code that actually runs and the code that hypothetically could run if you intervened on the state, and in order to explain the output of the program, you only need to concern yourself with the former, rather than also needing to consider the latter.
      
      For neural networks, I sort of assume there’s a similar thing going on, except it’s quite hard to define it precisely. In technical terms, neural networks lack a privileged basis which distinguishes different components of the network, so one cannot pick a discrete component and ask whether it runs and if so how it runs.
      
      This is a somewhat different definition of “on-manifold” than is usually used, as it doesn’t concern itself with the real-world data distribution. Maybe it’s wrong of me to use the term like that, but I feel like the two meanings are likely to be related, since the real-world distribution of data shaped the inner workings of the neural network. (I think this makes most sense in the context of the neural tangent kernel, though ofc YMMV as the NTK doesn’t capture nonlinearities.)
      
      In principle I don’t think it’s always important to stay on-manifold, it’s just what one of my lines of thought has been focused on. E.g. if you want to identify backdoors, going off-manifold in this sense doesn’t work.
      
      I agree with you that it is sketchy to estimate the manifold from wild empiricism. Ideally I’m thinking one could use the structure of the network to identify the relevant components for a single input, but I haven’t found an option I’m happy with.
      
      Also one convoluted (perhaps inefficient) idea but which felt kind of fun to stay on manifold is to do the following: (1) train your batch of steering vectors, (2) optimize in token space to elicit those steering vectors (i.e. by regularizing for the vectors to be close to one of the token vectors or by using an algorithm that operates on text), (3) check those tokens to make sure that they continue to elicit the behavior and are not totally wacky. If you cannot generate that steer from something that is close to a prompt, surely it’s not on manifold right? You might be able to automate by looking at perplexity or training a small model to estimate that an input prompt is a “realistic” sentence or whatever.
      
      Maybe. But isn’t optimization in token-space pretty flexible, such that this is a relatively weak test?
      
      Realistically steering vectors can be useful even if they go off-manifold, so I’d wait with trying to measure how on-manifold stuff is until there’s a method that’s been developed to specifically stay on-manifold. Then one can maybe adapt the measurement specifically to the needs of that method.