But it is unlikely that any lab will be able to make safety cases via alignment for the first AIs with dangerous capabilities.
Hmm, I think the situation is a bit more complex than this makes it sound. In particular, we might be able to weakly update against scheming based on a variety of empirical evidence even without being able to make a strong affirmative case.
As a central example of making a weak argument via alignment, consider doing a bunch of model organisms style experiments. This could plausibly update you against scheming via not seeing any compelling demonstrations of scheming. (Of course, it could also end up being a risk case rather than a safety case if you indeed do find compelling evidence for scheming!)
Another intervention like this would be to offer the model a deal in which you compensate it in exchange for the model giving you evidence on its own misalignment.
It’s also somewhat unclear exactly what should be considered part of a control safety case vs an alignment safety case.
For instance, consider an intervention like:
Perform extensive (plausible-to-the-model) sting operations (aka red-teaming focused on scheming). Do this at various points throughout post-training to reduce the risk that the model has learned to be more conservative due to other red-teaming efforts. Also, ideally ensure that all data which discusses this sort of sting-operation is removed for the initial training data. (You can potentially train the AI on this data at the end after running sting operations.)
The point of this intervention is to detect if your model is scheming. You could think about this from a control type perspective (we can argue that a scheming model might have failed to evade our sting operations and thus we get some (possibly small) update towards safety) or an alignment perspective (we now have evidence for alignment).
I personally like thinking about this from a control perspective.
And, we need to make sure that we don’t double count interventions like this in both control and alignment. (See here for more discussion of double counting and related matters.)
That said, if we’re going to think about all interventions like this sting operation intervention from a control perspective, we should probably consider a slightly more expansive notion of control:
Ensure that even if your models are scheming, you’ll be safe, because the models have a sufficiently high chance of failing to subvert your safety measures.
Using “is it sufficiently likely that it fails to subvert” makes it more clear that we’re allowing part of the control case to include interventions which would probably only catch schemers if they make serious errors (errors which from some perspective they were “capable” of avoiding). This means that you need to have some distribution over aspects of competence which are hard to measure (e.g., how good of a plan will the schemer follow?).
Hmm, I think the situation is a bit more complex than this makes it sound. In particular, we might be able to weakly update against scheming based on a variety of empirical evidence even without being able to make a strong affirmative case.
As a central example of making a weak argument via alignment, consider doing a bunch of model organisms style experiments. This could plausibly update you against scheming via not seeing any compelling demonstrations of scheming. (Of course, it could also end up being a risk case rather than a safety case if you indeed do find compelling evidence for scheming!)
Another intervention like this would be to offer the model a deal in which you compensate it in exchange for the model giving you evidence on its own misalignment.
It’s also somewhat unclear exactly what should be considered part of a control safety case vs an alignment safety case.
For instance, consider an intervention like:
Perform extensive (plausible-to-the-model) sting operations (aka red-teaming focused on scheming). Do this at various points throughout post-training to reduce the risk that the model has learned to be more conservative due to other red-teaming efforts. Also, ideally ensure that all data which discusses this sort of sting-operation is removed for the initial training data. (You can potentially train the AI on this data at the end after running sting operations.)
The point of this intervention is to detect if your model is scheming. You could think about this from a control type perspective (we can argue that a scheming model might have failed to evade our sting operations and thus we get some (possibly small) update towards safety) or an alignment perspective (we now have evidence for alignment).
I personally like thinking about this from a control perspective.
And, we need to make sure that we don’t double count interventions like this in both control and alignment. (See here for more discussion of double counting and related matters.)
That said, if we’re going to think about all interventions like this sting operation intervention from a control perspective, we should probably consider a slightly more expansive notion of control:
Using “is it sufficiently likely that it fails to subvert” makes it more clear that we’re allowing part of the control case to include interventions which would probably only catch schemers if they make serious errors (errors which from some perspective they were “capable” of avoiding). This means that you need to have some distribution over aspects of competence which are hard to measure (e.g., how good of a plan will the schemer follow?).