evhub comments on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

evhub 13 Jan 2024 0:02 UTC
LW: 5 AF: 3
2
AF
There are a lot of different things that we’ll probably be trying here, and I don’t know what will actually end up working, but I think the ideal thing that you could do would be a direct comparison between the features that activate in training for backdoored vs. non-backdoored models, and see if there are differences there that are correlated with lying, deception, etc. Probes are also good tools here, but it’s harder to use a probe to do a comparison between models, and it’s harder to validate that any differences you’ve found are meaningful. Effectively, a linear probe is equivalent to training a set of dictionary learning features specifically for the probe dataset, but if you trained them specifically for that dataset, then it’s easier to just overfit, whereas if you learned your features in an unsupervised way over the whole pre-training dataset, and then discovered there was one that was correlated with deception in multiple contexts and could identify backdoored models, I think that’s much more compelling.
- ryan_greenblatt 13 Jan 2024 1:09 UTC
  LW: 24 AF: 15
  28
  AF Parent
  
  ideal thing that you could do would be a direct comparison between the features that activate in training for backdoored vs. non-backdoored models, and see if there are differences there that are correlated with lying, deception, etc.
  
  The hope would be that this would transfer to learning a general rule which would also apply even in cases where you don’t have a “non-backdoored” model to work with? Or maybe the hope is just to learn some interesting things about how these models work internally which could have misc applications?
  
  whereas if you learned your features in an unsupervised way over the whole pre-training dataset, and then discovered there was one that was correlated with deception in multiple contexts and could identify backdoored models, I think that’s much more compelling
  
  Sure, but the actual case is that there will be at least thousands of “features” associated with deception many of which will operate in very domain specific ways etc (assuming you’ve actually captured a notion of features which might correspond to what the model is actually doing). So, the question will be how you operate over this large group of features. Maybe the implict claim is that averaging over this set of features will have better inductive biases than training a probe on some dataset because averaging over the set of features nicely handles model capacity? Or that you can get some measure over this group of features which is better than just normally training a classifer?
  
  I guess it just feels to me like you’re turning to a really complicated and hard-to-use tool which only has a pretty dubious reason for working better than a simple, well known, and easy-to-use tool. This feels like a mistake to me (but maybe I’m misunderstanding some important context). Minimally, I think it seems good to start by testing the probe baseline. If the probe approach works great, then it’s plausible that whatever autoencoder approach you end up trying work for the exact same reason as the probe works (they are correlated with some general notion of lying/deception which generalizes).
  
  I feel somewhat inclined to argue about this because I think by default people have a tendency to do things which are somewhat more associated with “internals” or “mech interp” or “being unsupervised” but which are in practice very similar to simple probing methods (see e.g. here for a case where I argue about something similar). I think this seems costly because it could waste a bunch of time and result in unjustified levels of confidence that people wouldn’t have if it was clear exactly what the technique was equivalent to. I’m not sure if you’re making a mistake here in this way, so sorry about picking on you in particular.
  - ryan_greenblatt 13 Jan 2024 1:40 UTC
    LW: 2 AF: 2
    0
    AF Parent
    (TBC, there are totally ways you could use autoencoders/internals which aren’t at all equivalent to just training a classifer, but I think this requires looking at connections (either directly looking at the weights or running intervention experiments).)