ryan_greenblatt comments on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

ryan_greenblatt 12 Jan 2024 22:59 UTC
LW: 8 AF: 6
2
AF

I think it’s quite plausible you’d see clear features related to deception in our models without needing to have the backdoor triggers

Would you expect this to work better than just training a probe to identify lying/deception/scheming and seeing if it fires more on average? If so why?

As in, you train the probe “off-policy” on examples of lying vs normal examples and then see if a probe trained in this way systematically fires more on models with backdoors.

You could either train the probe prior to inserting the backdoor (analogous to how I’d guess you do the autoencoder approach) and then have the probe generalize or train the probe after inserting the backdoor. (Either way, you’d plausibly want to have negative examples which don’t correspond to a generation from the model itself being schemy. E.g., you could use a generation from a human or whatever.)

(FWIW, I think the probe method is moderately likely to show some signal, but it’s unclear how much this matters because we don’t know what the baseline level of “deception” is. So the comparison to a normal model is importantly disanalogous. I have the same complaint about the SAE approach.)
- evhub 12 Jan 2024 23:44 UTC
  LW: 7 AF: 4
  2
  AF Parent
  
  Would you expect this to work better than just training a probe to identify lying/deception/scheming and seeing if it fires more on average? If so why?
  
  I think the main interesting thing you could do with the dictionary learning features that you couldn’t do with a probe is that you could try to do this in an unsupervised way.
  - ryan_greenblatt 12 Jan 2024 23:56 UTC
    LW: 5 AF: 3
    5
    AF Parent
    I don’t understand what you mean by unsupervised here?
    
    I’d guess the normal thing you’d do with the dictionary learning approach is look for a feature which activates on examples which look like deception. This seems quite supervised in that it requires you to identify deception containing examples. You could instead just construct examples which look like deception and then train a probe. Or you could use a labeled dataset to select which autoencoder feature you’d like based on what activates.
    
    Is there some other method you were thinking about?
    - evhub 13 Jan 2024 0:02 UTC
      LW: 5 AF: 3
      2
      AF Parent
      There are a lot of different things that we’ll probably be trying here, and I don’t know what will actually end up working, but I think the ideal thing that you could do would be a direct comparison between the features that activate in training for backdoored vs. non-backdoored models, and see if there are differences there that are correlated with lying, deception, etc. Probes are also good tools here, but it’s harder to use a probe to do a comparison between models, and it’s harder to validate that any differences you’ve found are meaningful. Effectively, a linear probe is equivalent to training a set of dictionary learning features specifically for the probe dataset, but if you trained them specifically for that dataset, then it’s easier to just overfit, whereas if you learned your features in an unsupervised way over the whole pre-training dataset, and then discovered there was one that was correlated with deception in multiple contexts and could identify backdoored models, I think that’s much more compelling.
      - ryan_greenblatt 13 Jan 2024 1:09 UTC
        LW: 24 AF: 15
        28
        AF Parent
        
        ideal thing that you could do would be a direct comparison between the features that activate in training for backdoored vs. non-backdoored models, and see if there are differences there that are correlated with lying, deception, etc.
        
        The hope would be that this would transfer to learning a general rule which would also apply even in cases where you don’t have a “non-backdoored” model to work with? Or maybe the hope is just to learn some interesting things about how these models work internally which could have misc applications?
        
        whereas if you learned your features in an unsupervised way over the whole pre-training dataset, and then discovered there was one that was correlated with deception in multiple contexts and could identify backdoored models, I think that’s much more compelling
        
        Sure, but the actual case is that there will be at least thousands of “features” associated with deception many of which will operate in very domain specific ways etc (assuming you’ve actually captured a notion of features which might correspond to what the model is actually doing). So, the question will be how you operate over this large group of features. Maybe the implict claim is that averaging over this set of features will have better inductive biases than training a probe on some dataset because averaging over the set of features nicely handles model capacity? Or that you can get some measure over this group of features which is better than just normally training a classifer?
        
        I guess it just feels to me like you’re turning to a really complicated and hard-to-use tool which only has a pretty dubious reason for working better than a simple, well known, and easy-to-use tool. This feels like a mistake to me (but maybe I’m misunderstanding some important context). Minimally, I think it seems good to start by testing the probe baseline. If the probe approach works great, then it’s plausible that whatever autoencoder approach you end up trying work for the exact same reason as the probe works (they are correlated with some general notion of lying/deception which generalizes).
        
        I feel somewhat inclined to argue about this because I think by default people have a tendency to do things which are somewhat more associated with “internals” or “mech interp” or “being unsupervised” but which are in practice very similar to simple probing methods (see e.g. here for a case where I argue about something similar). I think this seems costly because it could waste a bunch of time and result in unjustified levels of confidence that people wouldn’t have if it was clear exactly what the technique was equivalent to. I’m not sure if you’re making a mistake here in this way, so sorry about picking on you in particular.
        ryan_greenblatt 13 Jan 2024 1:40 UTC
        LW: 2 AF: 2
        0
        AF Parent
        (TBC, there are totally ways you could use autoencoders/internals which aren’t at all equivalent to just training a classifer, but I think this requires looking at connections (either directly looking at the weights or running intervention experiments).)
    - Bogdan Ionut Cirstea 13 Jan 2024 0:58 UTC
      1 point
      0
      Parent
      This post seems very relevant: https://www.lesswrong.com/posts/FbSAuJfCxizZGpcHc/interpreting-the-learning-of-deceit.