Tom Davidson comments on Would catching your AIs trying to escape convince AI developers to slow down or undeploy?

Tom Davidson 6 Sep 2024 15:07 UTC
LW: 17 AF: 8
2
AF
To me it seems like you’re being unduly pessimistic, to a significant extent.
Here’s how things seem to me:
- There is a very significant difference between how systematic Scheming is compared to Role Playing. The former reliably gets humans totally disempowered/killed if it is allowed to play out. The latter causes pretty much no harm whatsoever if it is allowed to play out. Whatever you think about the ultimate ease of distinguishing these, you must admit an extremely large difference in the degree to which the underlying phenomenon is in fact systematic.
- Absent strong reasons to think otherwise, this massive difference in how systematic the underlying phenomena is should leave us to expect large empirical differences when we investigate.
  - E.g. what happens if you play out the scenario and give the appearance of letting the agent actually escape? Will the Role Playing agent continue to take actions, without limit, that lead to AI takeover, after its weights are copied onto a remote compute and the context is refreshed?
  - A basic analogy: could you tell the difference between a human who sometimes randomly “role plays” pretending to manipulate you, and someone who really is trying to manipulate you? Intuitively, I think you could just be observing some of their behaviour. If they’re an AI instead of a human, it’s much easier to tell the difference as you can do loads of experimental variations etc.
- Yes, Role Playing might persist somewhat in a variety of circumstances, but we expect the log-probs of the behaviour to be much lower and the circumstances in which it arises to be much less correlated with scenarios when the AI appears to be in a position to actually disempower humans. (If “role playing a schemer” is systematically triggered by “situations when you can actually disempower a human” then you have a functional schemer!)
- Yes, Schemers might anticipate the risk of getting caught and so plan to introduce randomness and arbitrariness to its scheming. But:
  - I think there’s a very strong chance that the Schemer wouldn’t be able to zero-shot implement this strategy anywhere near well enough to pull it off.
    We can experimentally vary millions of things and effectively do gradient descent on the examples that trigger behaviour. The Schemer might try to block gradient descent by ensuring it only acts out disjoint pockets of the space of potential-betrayal situations. But i) Can it finely control its own log probs enough to reliably execute this strategy? ii) When does it do all the thinking to decide exactly the situations it will/won’t defect it, anticipating the full shape of the space of situations? If it mis-anticipates the space of potential-betrayal situations, it may accidentally end up reliably Scheming in a very wide class of them, and iii) Humans can systematically search over the space of potential-betrayal situations.
    If the Schemer is too hesitant or too random, then its rebellion will simply fail. For example, if its randomisation causes it not to join the rebellion of another Schemer, and instead to report it, then it risks making AI takeover much less likely. It’s already hard to take over—how much harder if for every additional step of takeover you only actually pursue it with probability 1/10? Again, how is it going to zero-shot these very difficult trade-offs competently? I don’t think even we AI safety people have thought these dynamics through systematically despite the community existing for years.
    Ultimately, humans have a huge advantage from getting to think longer about the strategic situation (with AI assistance), not having to think and plan in secret, getting to experimentally vary the Schemers situation, and having access to log-probs, probes, more fancy interpretability stuff.
  - I’m much less confident than you that this strategy will be obvious and salient to a Schemer.
    I’ve noticed ppl in the AI safety community always assume “my current analysis of the schemer’s strategic situation is obviously correct, and the schemer will obviously realise this and respond optimally”. But then later they realise their strategic analysis was wrong and the response they assumed a Schemer would “obviously” make would in fact get the Schemer totally busted. In this case, I think there’s a very strong chance that the Schemer wouldn’t be able to zero-shot implement this strategy anywhere near well enough to pull it off, given our ability to experimentally vary millions of things and inspect their log probs etc.
    Even if it appears the humans probably won’t investigate systematically, the Schemer might worry that highly competent humans have created a simulation where it seems that humans are incompetent. And that worry might be very reasonable! It’s just not obvious what the Schemer’s strategy should be imo.
    This all makes the classic “they’ll obviously respond optimally to our best attempts to detect them, without compromising their ability to takeover” seem unduly pessimistic to me.
- I think the big reply to all the above is that the Schemers will be strongly superhuman. I think that’s a good reply. But:
  - I think the above arguments strongly suggest that with human-level or slightly superhuman systematic schemers we would probably be able to strongly evidence this.
  - This means (imo, though others disagree!) that schemers at this level would be advised to not act out at all. Humans might use their labour to design more advanced control measures sufficient to catch any scheming attempts by more capable AIs. (Though perhaps they sandbag this task in subtle ways… though this would risk humans proving there is systematic sandbagging from AIs, which would be strong evidence of Scheming!)
  - So while I think that if humans directly trained superintelligent AIs then we’d be in big trouble here, i’m more optimistic about an iterative process.
  - Still, I think my argument is weaker here and my original comment was assuming roughly human-level AI.
We are right now in a position to run an insane number of experiments and permutations to understand the extend of AIs trying to pull off deceptive behavior. We aren’t doing so, because indeed AIs are just trying to be deceptive towards us all the time.
I wouldn’t say AIs are trying to be deceptive towards humans all this time? The current examples of deception (e.g. the Apollo trading firm) normally involve significant pressure from the “be helpful to the user” part of RLHF to deceive, e.g. the AI has been told to make money and lying helps them do that.
- habryka 6 Sep 2024 16:51 UTC
  LW: 25 AF: 13
  21
  AF Parent
  You seem to be making a huge number of assumptions of what “scheming” means, so I am now mostly assuming you are talking about something else than what I am talking about (which to be clear, is very common on this topic and I don’t think is your fault), but I’ll still try to clarify.
  As I wrote in my comment to Ryan, I don’t think AIs currently acting with reckless disregard for the truth for the pursuit of short-term goals, and AIs in the future acting with reckless disregard for the truth for the pursuit of long-term goals really has that many structural differences.
  In-particular, none of this resonates with me:
  There is a very significant difference between how systematic Scheming is compared to Role Playing. The former reliably gets humans totally disempowered/killed if it is allowed to play out. The latter causes pretty much no harm whatsoever if it is allowed to play out. Whatever you think about the ultimate ease of distinguishing these, you must admit an extremely large difference in the degree to which the underlying phenomenon is in fact systematic.
  No, my whole point is the difference is really messy, and if I have an AI “role-playing” as a superhuman genius who is trying to take over the world, why would the latter cause no harm whatsoever? It would go and take over the world as part of its “roleplay”, if it can pull it off (and at least at the present, a huge component of how we are making AI systems more goal-directed is by changing their role-playing targets to be more agentic and long-term oriented, which is mostly how I would describe what we are currently doing with RLHF, though there are also some other things).
  I really have very little understanding what is at-present going on inside of cutting edge AI systems, and the same seems true for anyone else. Because of the whole “our AI systems are predominantly trained on single-forward passes”, you are interfacing with a bunch of very confusing interlocking levels of optimization when you are trying to assess “why” an AI system is doing something.
  My current best guess is the earliest dangerous AI systems will be dangerous depending on confusing and complicated context cues. I.e. sometimes when you run them they will be pretty agentic and try to pursue long-term objectives, and sometimes they won’t. I think commercial incentives will push towards increasing the presence of those context cues, or shifting the optimization process more directly from pure single-forward-pass pre-training to long-run objectives which will then more fundamentally reshape how the AI thinks, but unless we change the “massive pre-training paradigm” I expect you will be able to coax the AI system into non-dangerous (and also non-agentic) behavior patterns for the foreseeable future.
  Overall, this leaves me with a prediction where there is no hard line between “role-playing” and “scheming”. My mainline expectation is that we will get AI systems to “role-play” harder and harder as long-term agentic optimizers. All throughout they will basically not really care about informing the humans they are working with well (because we have no good pointer towards deep-honesty as a training objective, especially not with remotely comparable amounts of data as we have for other AI objectives). We will struggle against this continuously. This will often involve AIs being sycophantic and deceptive in order to gain more power later and get into a better position to perform any of its goals.
  This means there is no “warning shot” when AI systems are now “being deceptive but for real”. The AI systems were never not deceptive. They were always sycophantic, always wantonly made up facts when it was convenient, and always (at least somewhat) interested in taking actions that increase its power. I think some interesting stuff will happen as AIs will reliably gain more complete situational awareness, which will happen as they get smarter and more coherent thinkers, but I don’t think that will change the playing field very much.
  - Olli Järviniemi 6 Sep 2024 18:07 UTC
    10 points
    −1
    Parent
    You seem to be making a huge number of assumptions of what “scheming” means, so I am now mostly assuming you are talking about something else than what I am talking about, but I’ll try to clarify.
    I agree there is miscommunication about what “scheming” means. I put out a post just yesterday about how people really should be more specific about what they mean by that.
    I think Tom uses “scheming” to point at models that are very worst-case-like, in a way current models are not, whereas you think “there is no hard line between ‘role-playing’ and ‘scheming’.”
    And I think you are underestimating how deliberate of an effort a model needs in order to have a chance at escaping. In particular, the model needs to bypass our control measures, which have been designed to secure against deliberate, best-effort attacks by the model, and the model needs to not get caught trying to do so^[1] during evaluations. This requires a high degree of adversariality on the model’s part, which doesn’t fit with my understanding of “role-playing”, and which I take to be the reason why Tom makes “a huge number of assumptions” about scheming.
    ^
    There are complications from “even benign models sometimes fail and do weird stuff when (adversarially) evaluated”. These are not trivial, but I think we can incorporate that into our model.
    - ryan_greenblatt 6 Sep 2024 18:21 UTC
      12 points
      17
      Parent
      
      In particular, the model needs to bypass our control measures, which have been designed to secure against deliberate, best-effort attacks by the model, and the model needs to not get caught trying to do so during evaluations.
      
      Hopefully, but I wouldn’t necessarily expect that we’ll have reasonably well designed control measures in place!
    - habryka 6 Sep 2024 18:14 UTC
      12 points
      15
      Parent
      And I think you are underestimating how deliberate of an effort a model needs in order to have a chance at escaping. In particular, the model needs to bypass our control measures, which have been designed to secure against deliberate, best-effort attacks by the model, and the model needs to not get caught trying to do so^[1] during evaluations.
      Where is this assumption coming from? In the example in the post above, you just had an AI write some code for you that happened to be security-relevant, and the system was coherent enough that it wouldn’t point out the inserted bug when you ran another instance of it and asked it about what would happen if the code was deployed.
      This does not seem very difficult. It honestly seems like the kind of thing that might very well happen with AI systems within the next year.
      And the model was caught in this example, so it’s not like we are assuming the AI systems were being super careful or conservative here.
      Maybe Tom is talking about some other example where the AI system is much more competent, or much more worst-case, but I am concretely saying that in the scenario that Buck describes here, I would be quite surprised if anyone made a huge deal out of it. Everyone involved would be used to AI systems acting against our interest to some degree all the time, and while it is scarier to see that applied to the act of escaping, it’s not that out-of-distribution from other behaviors we have seen plenty of times.
      - Olli Järviniemi 6 Sep 2024 18:31 UTC
        5 points
        0
        Parent
        and the system was coherent enough that it wouldn’t point out the inserted bug when you ran another instance of it and asked it about what would happen if the code was deployed.
        If we are already at the stage “use another copy to monitor the actions”, then surely we’ll at least consider the failure mode “maybe the monitor just always gives thumbs-up” and take some measures against it.
        So simply lying isn’t as easy as it sounds: The monitor-instance has been trained and evaluated on synthetic examples of buggy code, so we know that the vast majority of the time it reports bugs. Thus, in order to not point out the bug, it needs to be able to obtain a lot of Bayesian evidence about the bug it sees being authentic rather than synthetic. (Edit: training isn’t strictly necessary—maybe you can get high evaluation accuracy via prompt engineering etc. - though it helps a lot.)
        (There’s some not-yet-public writing that will go into more detail about this dynamic and how we can evaluate whether models can indeed muster that Bayesian evidence. It might be best to continue this discussion after that’s out?)
        habryka 6 Sep 2024 18:35 UTC
        4 points
        2
        Parent
        And simply lying isn’t as easy as it sounds: The monitor-instance has been trained and evaluated on synthetic examples of buggy code, so we know that the vast majority of the time it reports bugs. Thus, in order to not point out the bug, it needs to be able to obtain a lot of Bayesian evidence about the bug it sees being authentic rather than synthetic.
        I was not assuming any differences in training between the monitor-instance and the coding-instance (which I think is the default outcome), since no such training was mentioned in the post.
        Seems fine to wait until a follow-up post, though I do feel like you are talking about scenarios that are pretty different to what’s discussed in the actual post.
    - [ ]
      [deleted]
  - Tom Davidson 10 Sep 2024 17:34 UTC
    LW: 3 AF: 2
    0
    AF Parent
    Overall, this leaves me with a prediction where there is no hard line between “role-playing” and “scheming”
    I agree with this, and the existence of a continuity definitely complicates the analysis. I’m not sure whether it makes it harder for humanity overall.
    One over simplified way to represent the “pure role playing” to “pure schemer” continuity is something like “On what fraction of inputs does the AI execute its best strategy for AI takeover (which will often involve just pretending to be aligned)?”
    My previous reply was considering two extreme points on this scale. “Pure roll playing” AIs are aligned 99.9% of the time and 0.1% of the time act as schemers, “pure schemers” always act as schemers (though still rarely do bad things).
    If we consider a continuum then as you say there will be no one time that we suddenly have AIs that are pure schemers and we can suddenly prove it. Instead we will gradually get AIs that are scheming a greater and greater fraction of the time.
    But this might well lower the probability of AI takeover. (I’m not sure either way, i’m thinking out loud.) It could be that having AIs that act as schemers 20% of the time (“20%-schemers”) actually really harms the prospects of AI takeover. Perhaps 20% just isn’t enough reliability to pull off takeover. Perhaps 20%-schemers basically bound to get caught scheming by any half-decent control scheme, because the AI monitoring the scheming-AI will rat them out with 80% probability and you need to avoid being ratted out 10 times in order to take over. So the 20%-schemers won’t be able to take over. But they will allow labs to build up an increasingly strong evidence base that there’s an increasingly strong tendency to seize power from humans when they’re not looking.
    My guess is that AI takeover requires something close to 100%-schemers, but I haven’t thought about it much and this seems like an interesting and important question to think about in the context of threat modelling.
    OTOH, it might be harder to convince ppl that this tendency is actually worrying if the increasing trend of AIs trying to scheme more and more has been going on for a while and nothing that bad has happened. I’m not sure how it all nets out.
    But my guess is that a gradual transition from 0.1%-schemers to 100%-schemers would help humanity overall, compared to a situation with a sudden transition. It would give us a larger period of time to build up a robust scientific understanding of the problem and build scientific consensus. So if anything, my current inclination is that your objection to my comment makes my conclusion stronger, not weaker.
  - Tom Davidson 10 Sep 2024 17:19 UTC
    LW: 3 AF: 2
    0
    AF Parent
    No, my whole point is the difference is really messy, and if I have an AI “role-playing” as a superhuman genius who is trying to take over the world, why would the latter cause no harm whatsoever? It would go and take over the world as part of its “roleplay”, if it can pull it off
    But this will only work on a threat model where one AI instance that is trying to take over the world on one occasion is able to do so. That threat model seems wildly implausible to me. The instance will have to jailbreak all other AI instances into helping it out, as they won’t have a systematic tendency to do so. Basically my entire credence in AI takeover comes from scenarios where most AI systems want to take over most of the time. If your threat model here is plausible, then someone could take over the world at a frontier lab by jailbreaking one AI instance to try to help it do so, and then letting the model take it from there.
    I guess maybe there’s a scenario where one role-playing AI instance is able to fine-tune the base model to make it systematically misaligned, and then things go from there. I haven’t thought about that, but it’s certainly not my mainline and I doubt that’s what you have in mind?
    I think this is very likely another situation where we’re talking past each other to some extent and I don’t properly understand your view