Do dangerous capability evals at least every 2x increase in effective training compute. This involves fine-tuning for dangerous capabilities, then doing evals on pre-mitigation and post-mitigation versions of the fine-tuned model. Score the models as Low, Medium, High, or Critical in each of several categories.
Initial categories: cybersecurity, CBRN (chemical, biological, radiological, nuclear threats), persuasion, and model autonomy.
If the post-mitigation model scores High in any category, don’t deploy it until implementing mitigations such that it drops to Medium.
If the post-mitigation model scores Critical in any category, stop developing it until implementing mitigations such that it drops to High.
If the pre-mitigation model scores High in any category, harden security to prevent exfiltration of model weights. (Details basically unspecified for now.)
This outlines a very iterative procedure. If models started hitting the thresholds, and this logic was applied repeatedly, the effect could be pushing the problems under the rug. At levels of ability sufficient to trigger those thresholds, I would be worried about staying on the bleeding edge of danger, tweaking the model until problems don’t show up in evals.
I guess this strategy is intended to be complemented with the superalignment effort, and not to be pushed on indefinitely and the main alignment strategy.
Yeah, largely the letter of the law isn’t sufficient.
Some evals are hard to goodhart. E.g. “can red-teamers demonstrate problems (given our mitigations)” is pretty robust — if red-teamers can’t demonstrate problems, that’s good evidence of safety (for deployment with those mitigations), even if the mitigations feel jury-rigged.
Yeah, this is intended to be complemented by superalignment.
Some mitigation strategies are more thorough than others. For example, if they hit a High on bioweapons, and responded by filtering a lot of relevant bits of Biological literature out of the pretaining set until it dropped to Medium, I wouldn’t be particularly concerned about that information somehow sneaking back in as the model later got stronger. Though if you’re trying to censor specific knowledge, that might become more challenging for a much more capable models that could have some ability to recreate the missing bits by logical inference from nearby material that wan’t censored out. This is probably more of concern for theory than for practical knowledge.
This outlines a very iterative procedure. If models started hitting the thresholds, and this logic was applied repeatedly, the effect could be pushing the problems under the rug. At levels of ability sufficient to trigger those thresholds, I would be worried about staying on the bleeding edge of danger, tweaking the model until problems don’t show up in evals.
I guess this strategy is intended to be complemented with the superalignment effort, and not to be pushed on indefinitely and the main alignment strategy.
Some personal takes in response:
Yeah, largely the letter of the law isn’t sufficient.
Some evals are hard to goodhart. E.g. “can red-teamers demonstrate problems (given our mitigations)” is pretty robust — if red-teamers can’t demonstrate problems, that’s good evidence of safety (for deployment with those mitigations), even if the mitigations feel jury-rigged.
Yeah, this is intended to be complemented by superalignment.
Some mitigation strategies are more thorough than others. For example, if they hit a High on bioweapons, and responded by filtering a lot of relevant bits of Biological literature out of the pretaining set until it dropped to Medium, I wouldn’t be particularly concerned about that information somehow sneaking back in as the model later got stronger. Though if you’re trying to censor specific knowledge, that might become more challenging for a much more capable models that could have some ability to recreate the missing bits by logical inference from nearby material that wan’t censored out. This is probably more of concern for theory than for practical knowledge.