4gate comments on Mechanistically Eliciting Latent Behaviors in Language Models

4gate 6 May 2024 18:31 UTC
1 point
0
AF
Maybe a dumb question but (1) how can we know for sure if we are on manifold, (2) why is it so important to stay on manifold? I’m guessing that you mean that vaguely we want to stay within the space of possible activations induced by inputs from data that is in some sense “real-world.” However, there appear to be a couple complications: (1) measuring distributional properties of later layers from small to medium sized datasets doesn’t seem like a realistic estimate of what should be expected of an on-manifold vector since it’s likely later layers are more semantically/high-level focused and sparse; (2) what people put into the inputs does change in small ways simply due to new things happening in the world, but also there are prompt engineering attacks that people use that are likely in some sense “off-distribution” but still in the real world and I don’t think we should ignore these fully. Is this notion of a manifold a good way to think about the notion of getting indicative information of real world behavior? Probably, but I’m not sure so I thought I might ask. I am new to this field.
I do thing at the end of the day we want indicative information, so I think somewhat artifical environments might at times have a certain usefulness.
Also one convoluted (perhaps inefficient) idea but which felt kind of fun to stay on manifold is to do the following: (1) train your batch of steering vectors, (2) optimize in token space to elicit those steering vectors (i.e. by regularizing for the vectors to be close to one of the token vectors or by using an algorithm that operates on text), (3) check those tokens to make sure that they continue to elicit the behavior and are not totally wacky. If you cannot generate that steer from something that is close to a prompt, surely it’s not on manifold right? You might be able to automate by looking at perplexity or training a small model to estimate that an input prompt is a “realistic” sentence or whatever.
Curious to hear thoughts :)
- tailcalled 7 May 2024 13:59 UTC
  2 points
  0
  Parent
  I think it’s easier to see the significance if you imagine the neural networks as a human-designed system. In e.g. a computer program, there’s a clear distinction between the code that actually runs and the code that hypothetically could run if you intervened on the state, and in order to explain the output of the program, you only need to concern yourself with the former, rather than also needing to consider the latter.
  
  For neural networks, I sort of assume there’s a similar thing going on, except it’s quite hard to define it precisely. In technical terms, neural networks lack a privileged basis which distinguishes different components of the network, so one cannot pick a discrete component and ask whether it runs and if so how it runs.
  
  This is a somewhat different definition of “on-manifold” than is usually used, as it doesn’t concern itself with the real-world data distribution. Maybe it’s wrong of me to use the term like that, but I feel like the two meanings are likely to be related, since the real-world distribution of data shaped the inner workings of the neural network. (I think this makes most sense in the context of the neural tangent kernel, though ofc YMMV as the NTK doesn’t capture nonlinearities.)
  
  In principle I don’t think it’s always important to stay on-manifold, it’s just what one of my lines of thought has been focused on. E.g. if you want to identify backdoors, going off-manifold in this sense doesn’t work.
  
  I agree with you that it is sketchy to estimate the manifold from wild empiricism. Ideally I’m thinking one could use the structure of the network to identify the relevant components for a single input, but I haven’t found an option I’m happy with.
  
  Also one convoluted (perhaps inefficient) idea but which felt kind of fun to stay on manifold is to do the following: (1) train your batch of steering vectors, (2) optimize in token space to elicit those steering vectors (i.e. by regularizing for the vectors to be close to one of the token vectors or by using an algorithm that operates on text), (3) check those tokens to make sure that they continue to elicit the behavior and are not totally wacky. If you cannot generate that steer from something that is close to a prompt, surely it’s not on manifold right? You might be able to automate by looking at perplexity or training a small model to estimate that an input prompt is a “realistic” sentence or whatever.
  
  Maybe. But isn’t optimization in token-space pretty flexible, such that this is a relatively weak test?
  
  Realistically steering vectors can be useful even if they go off-manifold, so I’d wait with trying to measure how on-manifold stuff is until there’s a method that’s been developed to specifically stay on-manifold. Then one can maybe adapt the measurement specifically to the needs of that method.