ryan_greenblatt comments on Catastrophic sabotage as a major threat model for human-level AI systems

ryan_greenblatt 23 Oct 2024 16:02 UTC
LW: 6 AF: 6
4
AF
I would say that the “internal use restrictions” are just an ad hoc control argument and I’d prefer to explicitly think about it like that. But, it is worth noting that if you are sufficiently conservative in the deployment (as Evan describes), then I think it would be fine to make a very minimal safety case.

Here is an example of what this could look like:
- The model isn’t quite superhuman in any important way. Supporting claims:
  - We’re confident the model isn’t wildly more powerful than it appears. Evidence: [you have to fill out]
  - The model appears to not be superhuman in any important way. Evidence: We ran some capability evals.
- Our deployment is so conservative that as long as the model isn’t quite super human, the model couldn’t do anything even if it wanted to. Supporting claims:
  - We thought about it some and it seems so conservative that it is unlikely we could be wrong even though we didn’t spend that long doing threat modeling.
  - We’ve actually implemented the relevant infrastructure (for sandboxing etc) and it is used in all the relevant places (including in RL and evaluations of the model).
  - We’ll enforce this level of conservativeness in a sufficiently robust way that we’re confident that employee error won’t be an issue. Evidence: [you have to fill out]
Note that this safety case doesn’t require running any control evals and it probably requires no ML work other than evals you were probably already doing and supporting evidence for the model not being wildly more capable than it seems. It probably requires some non-trivial infra to set up and enforce this conservative of a deployment.