I think the control safety case described here is moderately different from what I’d recommend. In particular, this post describes needing to generate situations that are indistinguishable from real situations:
The situations for the game were generated using Simulation Procedure S, which we are confident generates situations from a distribution that the model cannot distinguish from the genuine deployment distribution. (See below for validating the indistinguishability of this distribution.)
I don’t know why Roger thinks you need this. Control evaluations for the control protocols listed don’t require situations that the model can’t distinguish from the genuine deployment distribution; it’s fine if the situations are visibly fake. The control evaluation involves getting a red team to elicit the untrusted model to generate attacks, so it doesn’t matter if the model knows that the situation is fake. (There are techniques that require synthetic input generation, e.g. upfront auditing as we presented it in our paper, but trusted monitoring isn’t one of them.)
The condition we do require for the control-style safety case to be valid is that the situations in the evaluation are at least as easy to attack on as the real distribution. (For example, this distribution needs to have as many good opportunities to attack, it needs to be similarly confusing to the trusted model, etc…)
Roger writes:
The idea here is to achieve sufficient realism and sufficient coverage of high-stakes deployment scenarios that the deployed model could never (in a high-stakes scenario) be highly confident that it’s in deployment.
Note that if the model is deployed on many more examples than it’s evaluated on, it can be pretty confident that it’s in deployment on any particular example just on priors, even if there’s no distribution shift. See here for more.
The reasoning for the indistinguishability criterion is as a precaution against a model sandbagging when used as the red team attack model. If the model can tell it’s in a control eval it could pretend to be worse at attacking, making us think the control measures are better than they actually are. If the red team elicitation is robust to this, I agree this isn’t necessary. (See the note below [P2.4]).
I don’t think this makes any sense. How are you hoping to get the model to attack except by telling it that it’s in a control evaluation and you want it to attack? It seems that you are definitely going to have to handle the sandbagging.
Note that if the model is deployed on many more examples than it’s evaluated on, it can be pretty confident that it’s in deployment on any particular example just on priors, even if there’s no distribution shift. See here for more.
My initial read was that indistinguishability was used to justify that the trusted monitor couldn’t distinguish the deployment distribution.
I’m not sure if that’s what Roger meant—if so, it’s still a bit confusing since then he wouldn’t have needed to call the monitor ‘trusted.’
I think the control safety case described here is moderately different from what I’d recommend. In particular, this post describes needing to generate situations that are indistinguishable from real situations:
I don’t know why Roger thinks you need this. Control evaluations for the control protocols listed don’t require situations that the model can’t distinguish from the genuine deployment distribution; it’s fine if the situations are visibly fake. The control evaluation involves getting a red team to elicit the untrusted model to generate attacks, so it doesn’t matter if the model knows that the situation is fake. (There are techniques that require synthetic input generation, e.g. upfront auditing as we presented it in our paper, but trusted monitoring isn’t one of them.)
The condition we do require for the control-style safety case to be valid is that the situations in the evaluation are at least as easy to attack on as the real distribution. (For example, this distribution needs to have as many good opportunities to attack, it needs to be similarly confusing to the trusted model, etc…)
Roger writes:
Note that if the model is deployed on many more examples than it’s evaluated on, it can be pretty confident that it’s in deployment on any particular example just on priors, even if there’s no distribution shift. See here for more.
The reasoning for the indistinguishability criterion is as a precaution against a model sandbagging when used as the red team attack model. If the model can tell it’s in a control eval it could pretend to be worse at attacking, making us think the control measures are better than they actually are. If the red team elicitation is robust to this, I agree this isn’t necessary. (See the note below [P2.4]).
I don’t think this makes any sense. How are you hoping to get the model to attack except by telling it that it’s in a control evaluation and you want it to attack? It seems that you are definitely going to have to handle the sandbagging.
FWIW I agree with you and wouldn’t put it the way it is in Roger’s post. Not sure what Roger would say in response.
My initial read was that indistinguishability was used to justify that the trusted monitor couldn’t distinguish the deployment distribution.
I’m not sure if that’s what Roger meant—if so, it’s still a bit confusing since then he wouldn’t have needed to call the monitor ‘trusted.’