Lukas Finnveden comments on Sabotage Evaluations for Frontier Models

Lukas Finnveden 21 Oct 2024 4:21 UTC
5 points
1
There’s at least two different senses in which “control” can “fail” for a powerful system:
- Control evaluations can indicate that there’s no way to deploy the system such that you both (i) get a lot of use out of it, and (ii) can get a low probability of catastrophe.
- Control evaluations are undermined such that humans think that the model can be deployed safely, but actually the humans were misled and there’s a high probability of catastrophe.
My impression is that Ryan & Buck typically talks about the first case. (E.g. in the link above.) I.e.: My guess would be that they’re not saying that well-designed control evaluations become untrustworthy — just that they’ll stop promising you safety.
But to be clear: In this question, you’re asking about something more analogous to the second case, right? (Sabotage/sandbagging evaluations being misleading about models’ actual capabilities at sabotage & sandbagging?)
My question posed in other words: Would you count “evaluations clearly say that models can sabotage & sandbag” as success or failure?
- ryan_greenblatt 21 Oct 2024 4:55 UTC
  6 points
  2
  Parent
  My guess would be that they’re not saying that well-designed control evaluations become untrustworthy
  
  It’s a bit messy because we have some ability to check whether we should be able to evaluate things.
  
  So, there are really three relevant “failure” states for well done control:
  1. We can’t find countermeasures such that our control evaluations indicate any real safety.
  2. We no longer think that control evaluations work to evaluate safety as models greatly exceed our planning capabilites and/or can sandbag sufficiently well.
  3. We think control evaluations can work well, but we’re wrong and they actually don’t.
  I think (1) or (2) will likely happen prior to (3) if you do a good job.
  
  We discuss this more here.
  
  (Really, the actual system is more complex because we can vary how conservative control evaluations are and it is easier to be confident in the results of more conservative evaluations. E.g., we can be conservative about our AIs planning ability. So, there is really a somewhat more continuous tradeoff than the above list makes it seem.)
  - David Duvenaud 21 Oct 2024 20:27 UTC
    3 points
    2
    Parent
    I basically agree with everything @Lukas Finnveden and @ryan_greenblatt said above. Another relevant “failure” state is that we might only be able to come up with expensive or capability-reducing mitigations, meaning everyone will have an incentive to disable or work around them.
  - Akash 22 Oct 2024 15:44 UTC
    2 points
    0
    Parent
    @ryan_greenblatt can you say more about what you mean by this one?
    We can’t find countermeasures such that our control evaluations indicate any real safety.
    - ryan_greenblatt 22 Oct 2024 17:11 UTC
      3 points
      0
      Parent
      Our evaluations indicate that we have very poor levels of safety (e.g. the AI would probably be able to escape if it wanted to) and we can’t find countermeasures which suffice to ensure any level of safety (without basically giving up on using this model).
- Akash 22 Oct 2024 15:43 UTC
  2 points
  0
  Parent
  I appreciate this distinction between the different types of outcomes—nice.
  I’m pretty interested in peoples’ forecasts for each. So revised questions would be:
  1. Evals are trustworthy but say that the model is no longer safe (or rather something like “there is a non-trivial probability that the model may be dangerous/scheming.”). Using your best judgment, at what point do you expect sabotage evaluations will reveal that a model might not be safe (we’re not confident that it’s not scheming)? At what capability threshold do you expect we would no longer be able to say “yup, we are confident that the model is not scheming”? (Feel free to use heuristics like the OpenAI PF capability thresholds or ASL levels).
  2. Evals are no longer trustworthy and we notice. Using your best judgment, at what point do you expect sabotage evaluations will no longer be useful evals for determining whether or not a model is scheming? At what capability threshold do you expect we would no longer be able to say “yup, we are confident that these evals are trustworthy/useful”?
  3. Evals are no longer trustworthy and we do not notice. Using your best judgment, how likely is it that sabotage evaluations are undermined such that humans think the model can be deployed safely, but actually the humans were misled? At what capability threshold would you worry about this form of undermining?
  @David Duvenaud would be curious about your takes, even if they’re just speculative guesses/predictions.
  (And @ryan_greenblatt would be curious for your takes but replace “sabotage evaluations” with “control evaluations.” In the event that you’ve written this up elsewhere, feel free to link it– helpful for me if you quote the particular section.)
  - David Duvenaud 11 Nov 2024 17:40 UTC
    3 points
    0
    Parent
    Good questions. For all of them, I’m afraid I have to say “It depends under what mitigations we deploy them.”. If there’s a crisis at one of the labs, I expect it to look like: “In order to be confident that even a misaligned model can’t sabotage us, we need to deploy it with expensive mitigations, unacceptable levels of oversight, or mitigations that depend on how the users incorporate AI output into their decision-making processes, which we can’t measure or control. Our competitors are going to deploy something similar nonetheless—what do we do?”
    
    I.e. I think there is lots of potential for labs to come to incorrect conclusions or be unable to be confident about the potential for sabotage. This issue can be mostly addressed by expensive-enough mitigations, but this introduces a safety tax that might be too expensive to pay.
    
    That’s one reason why these kinds of evals are valuable—they can, in principle, reduce the safety tax by helping us find the least-expensive mitigations. They also incentivize red-teaming, which can produce scary demos to help coordinate regulation. They also force labs to think through the implications of how models are deployed and how much they’re trusted.
    
    Relatedly, one problem with caring about whether or not “actually the humans were misled” is that that’s a matter of degree, and I think in most situations, human decision-makers are going to be misled about at least some important aspects of whatever situation they’re in, even after extensive consultation with an LLM. We can try to refine this criterion to “misled about something that would make them make the wrong decision” but then we’re basically just asking the model to convince the human to make the right decision according to the LLM. In that case, we’ve lost a lot of the opportunities for corrigibility that explanations would provide.
    
    I think there’s probably still a useful middle ground, but I think practical uses of LLMs will look pretty paternalistic for most users most of the time, in a way that will be hard to distinguish from “is the user misled”.