Thanks for writing this up. Some miscellaneous thoughts:
My guess is that the safety property you’ll want to ensure here is something like “the model is unlikely to succeed at a problematically large quantity of sabotage without us first getting a big (20:1) update that it’s engaging in intentional sabotage”.
I think your internal usage restrictions are wildly unaffordable; it seems extremely unlikely that a sensible AI lab should agree to not deploying this model as a coding agent, or not use it for day-to-day tasks. My guess is that companies should treat the untrusted model similarly to how they treat external contractors (or how they treat employees at organizations that take insider threat seriously).
We’re doing some research related to control measures for this kind of threat model, some of which should be out soon.
I think your internal usage restrictions are wildly unaffordable
Sure, but note that Evan says “Lifting those internal usage restrictions requires an affirmative safety case (as specified below) for safe internal usage.”
I don’t think it is (clearly) wildly unaffordable to require an affirmative safety case prior to proceeding with (substantially) using the model internally? (Note that control could be part of such a case.)
I think Evan is just trying to say “using the model some (but not for XYZ) wouldn’t (shouldn’t?) even require making an affirmative safety case (at least for the first point at which this triggers)”.
I think this can be consistent with treating models like external contractors.
(Though I’d note that “treat the model like external contractors” is maybe too weak and I’d replace this with something like “treat the model like external contractors who have a 25% chance of all being spies of a foreign adversary”.)
I do think that making an affirmative safety case for a “very low” level of total lifetime takeover risk (perhaps 0.25%) is very likely to be totally unaffordable at some point (at least in short timelines and with my priors).
I would say that the “internal use restrictions” are just an ad hoc control argument and I’d prefer to explicitly think about it like that. But, it is worth noting that if you are sufficiently conservative in the deployment (as Evan describes), then I think it would be fine to make a very minimal safety case.
Here is an example of what this could look like:
The model isn’t quite superhuman in any important way. Supporting claims:
We’re confident the model isn’t wildly more powerful than it appears. Evidence: [you have to fill out]
The model appears to not be superhuman in any important way. Evidence: We ran some capability evals.
Our deployment is so conservative that as long as the model isn’t quite super human, the model couldn’t do anything even if it wanted to. Supporting claims:
We thought about it some and it seems so conservative that it is unlikely we could be wrong even though we didn’t spend that long doing threat modeling.
We’ve actually implemented the relevant infrastructure (for sandboxing etc) and it is used in all the relevant places (including in RL and evaluations of the model).
We’ll enforce this level of conservativeness in a sufficiently robust way that we’re confident that employee error won’t be an issue. Evidence: [you have to fill out]
Note that this safety case doesn’t require running any control evals and it probably requires no ML work other than evals you were probably already doing and supporting evidence for the model not being wildly more capable than it seems. It probably requires some non-trivial infra to set up and enforce this conservative of a deployment.
Thanks for writing this up. Some miscellaneous thoughts:
My guess is that the safety property you’ll want to ensure here is something like “the model is unlikely to succeed at a problematically large quantity of sabotage without us first getting a big (20:1) update that it’s engaging in intentional sabotage”.
I think your internal usage restrictions are wildly unaffordable; it seems extremely unlikely that a sensible AI lab should agree to not deploying this model as a coding agent, or not use it for day-to-day tasks. My guess is that companies should treat the untrusted model similarly to how they treat external contractors (or how they treat employees at organizations that take insider threat seriously).We’re doing some research related to control measures for this kind of threat model, some of which should be out soon.
Sure, but note that Evan says “Lifting those internal usage restrictions requires an affirmative safety case (as specified below) for safe internal usage.”
I don’t think it is (clearly) wildly unaffordable to require an affirmative safety case prior to proceeding with (substantially) using the model internally? (Note that control could be part of such a case.)
I think Evan is just trying to say “using the model some (but not for XYZ) wouldn’t (shouldn’t?) even require making an affirmative safety case (at least for the first point at which this triggers)”.
I think this can be consistent with treating models like external contractors.
(Though I’d note that “treat the model like external contractors” is maybe too weak and I’d replace this with something like “treat the model like external contractors who have a 25% chance of all being spies of a foreign adversary”.)
I do think that making an affirmative safety case for a “very low” level of total lifetime takeover risk (perhaps 0.25%) is very likely to be totally unaffordable at some point (at least in short timelines and with my priors).
Oh yeah sorry, I misunderstood that
I would say that the “internal use restrictions” are just an ad hoc control argument and I’d prefer to explicitly think about it like that. But, it is worth noting that if you are sufficiently conservative in the deployment (as Evan describes), then I think it would be fine to make a very minimal safety case.
Here is an example of what this could look like:
The model isn’t quite superhuman in any important way. Supporting claims:
We’re confident the model isn’t wildly more powerful than it appears. Evidence: [you have to fill out]
The model appears to not be superhuman in any important way. Evidence: We ran some capability evals.
Our deployment is so conservative that as long as the model isn’t quite super human, the model couldn’t do anything even if it wanted to. Supporting claims:
We thought about it some and it seems so conservative that it is unlikely we could be wrong even though we didn’t spend that long doing threat modeling.
We’ve actually implemented the relevant infrastructure (for sandboxing etc) and it is used in all the relevant places (including in RL and evaluations of the model).
We’ll enforce this level of conservativeness in a sufficiently robust way that we’re confident that employee error won’t be an issue. Evidence: [you have to fill out]
Note that this safety case doesn’t require running any control evals and it probably requires no ML work other than evals you were probably already doing and supporting evidence for the model not being wildly more capable than it seems. It probably requires some non-trivial infra to set up and enforce this conservative of a deployment.