Akash comments on Sabotage Evaluations for Frontier Models

Akash 22 Oct 2024 15:43 UTC
2 points
0
I appreciate this distinction between the different types of outcomes—nice.
I’m pretty interested in peoples’ forecasts for each. So revised questions would be:
1. Evals are trustworthy but say that the model is no longer safe (or rather something like “there is a non-trivial probability that the model may be dangerous/scheming.”). Using your best judgment, at what point do you expect sabotage evaluations will reveal that a model might not be safe (we’re not confident that it’s not scheming)? At what capability threshold do you expect we would no longer be able to say “yup, we are confident that the model is not scheming”? (Feel free to use heuristics like the OpenAI PF capability thresholds or ASL levels).
2. Evals are no longer trustworthy and we notice. Using your best judgment, at what point do you expect sabotage evaluations will no longer be useful evals for determining whether or not a model is scheming? At what capability threshold do you expect we would no longer be able to say “yup, we are confident that these evals are trustworthy/useful”?
3. Evals are no longer trustworthy and we do not notice. Using your best judgment, how likely is it that sabotage evaluations are undermined such that humans think the model can be deployed safely, but actually the humans were misled? At what capability threshold would you worry about this form of undermining?
@David Duvenaud would be curious about your takes, even if they’re just speculative guesses/predictions.
(And @ryan_greenblatt would be curious for your takes but replace “sabotage evaluations” with “control evaluations.” In the event that you’ve written this up elsewhere, feel free to link it– helpful for me if you quote the particular section.)
- David Duvenaud 11 Nov 2024 17:40 UTC
  3 points
  0
  Parent
  Good questions. For all of them, I’m afraid I have to say “It depends under what mitigations we deploy them.”. If there’s a crisis at one of the labs, I expect it to look like: “In order to be confident that even a misaligned model can’t sabotage us, we need to deploy it with expensive mitigations, unacceptable levels of oversight, or mitigations that depend on how the users incorporate AI output into their decision-making processes, which we can’t measure or control. Our competitors are going to deploy something similar nonetheless—what do we do?”
  
  I.e. I think there is lots of potential for labs to come to incorrect conclusions or be unable to be confident about the potential for sabotage. This issue can be mostly addressed by expensive-enough mitigations, but this introduces a safety tax that might be too expensive to pay.
  
  That’s one reason why these kinds of evals are valuable—they can, in principle, reduce the safety tax by helping us find the least-expensive mitigations. They also incentivize red-teaming, which can produce scary demos to help coordinate regulation. They also force labs to think through the implications of how models are deployed and how much they’re trusted.
  
  Relatedly, one problem with caring about whether or not “actually the humans were misled” is that that’s a matter of degree, and I think in most situations, human decision-makers are going to be misled about at least some important aspects of whatever situation they’re in, even after extensive consultation with an LLM. We can try to refine this criterion to “misled about something that would make them make the wrong decision” but then we’re basically just asking the model to convince the human to make the right decision according to the LLM. In that case, we’ve lost a lot of the opportunities for corrigibility that explanations would provide.
  
  I think there’s probably still a useful middle ground, but I think practical uses of LLMs will look pretty paternalistic for most users most of the time, in a way that will be hard to distinguish from “is the user misled”.