Otherwise, we could easily in the future release a model that is actually (without loss of generality) High in Cybersecurity or Model Autonomy, or much stronger at assisting with AI R&D, with only modest adjustments, without realizing that we are doing this. That could be a large or even fatal mistake, especially if circumstances would not allow the mistake to be taken back. We need to fix this.
[..]
This is a lower bound, not an upper bound. But what you need, when determining whether a model is safe, is an upper bound! So what do we do?
Part of the problem is the classic problem with model evaluations: elicitation efforts, by default, only ever provide existence proofs and rarely if ever provide completeness proofs. A prompt that causes the model to achieve a task provides strong evidence of model capability, but the space of reasonable prompts is far too vast to search exhaustively to truly demonstrate mode incapability. Model incapability arguments generally rely on an implicit “we’ve tried as hard at elicitation as would be feasible post deployment”, but this is almost certainly not going to be the case, given the scale of pre-deployment evaluations vs post-deployment use cases.
The way you get a reasonable upper bound pre-deployment is by providing pre-deployment evaluators with some advantage over end-users, for example by using a model that’s not refusal trained or by allowing for small amounts of finetuning. OpenAI did do this in their original preparedness team bio evals; specifically, they provided experts with non—refusal fine-tuned models. But it’s quite rare to see substantial advantages given to pre-deployment evaluators for a variety of practical and economic reasons, and in-house usage likely predates pre-deployment capability/safety evaluations anyways.
Part of the problem is the classic problem with model evaluations: elicitation efforts, by default, only ever provide existence proofs and rarely if ever provide completeness proofs. A prompt that causes the model to achieve a task provides strong evidence of model capability, but the space of reasonable prompts is far too vast to search exhaustively to truly demonstrate mode incapability. Model incapability arguments generally rely on an implicit “we’ve tried as hard at elicitation as would be feasible post deployment”, but this is almost certainly not going to be the case, given the scale of pre-deployment evaluations vs post-deployment use cases.
The way you get a reasonable upper bound pre-deployment is by providing pre-deployment evaluators with some advantage over end-users, for example by using a model that’s not refusal trained or by allowing for small amounts of finetuning. OpenAI did do this in their original preparedness team bio evals; specifically, they provided experts with non—refusal fine-tuned models. But it’s quite rare to see substantial advantages given to pre-deployment evaluators for a variety of practical and economic reasons, and in-house usage likely predates pre-deployment capability/safety evaluations anyways.