I think it’d be great if we sorted dangerous capabilities evals into two categories, mitigated and unmitigated hazards.
Mitigated means the hazards you measure with enforceable mitigations in place, as in behind a secure API. This includes:
API-based fine-tuning, where you filter the data that customers are allowed to fine-tune on and/or put various restrictions in place on the fine-tuning process.
Limited prompt-only jailbreaks (including long context many-example jailbreaks). This could include requiring that the jail-breaking needs to evade a filter which is trying to catch and block users that try to jailbreak.
Note that in the context of dangerous capabilities evals, jail-breaking can look like ‘refusal dodging’. Refusal dodging is when you try to justify your question about dangerous technology by placing it in a reasonable context, such as a student studying for an exam, or analyzing an academic paper. If the model will summarize and extract key information from hazardous academic papers, that should be a red flag on a capability eval.
Red-teamers may try to sneak past filters by, for example, fine-tuning the model to communicate in code, using purely innocent statements in plain-text. It’s fair game for the developers to put filters in place to try to catch red-teamers attempting to do this. See https://arxiv.org/html/2406.20053v1
This sounds like it’s relatively easy-mode, and indeed it should be. But I still want to see a separate report on this, since it reassures me that the developer in question is taking reasonable precautions and has implemented them competently.
Unmitigated means the hazards you measure as if the weights had been stolen, or you deliberately released the weights (looking at you Meta). Unmitigated includes:
Unlimited unfiltered jail-breaking attempts.
White-box jail-breaking, where you get to use an optimization process working against the activations of the model to avoid refusals. (white box attacks are generally harder to resist than black-box attacks).
Activation steering for non-refusal or for capabilities elicitation
Fine-tuning on task-specific domain knowledge and examples of alignment to terrorist agendas
Merging the model with other models or architectures
It’s fair game for the company to do anti-fine-tuning modifications of the model before doing the unmitigated testing. It’s not fair game for the company to restrict what the red-teamers do to try to undo such measures. See https://arxiv.org/abs/2405.14577
I think this distinction is important, since I don’t think that any companies so far have been good at publishing cleanly separated scores for mitigated and unmitigated. What OpenAI called ‘unmitigated’ is what I would call ‘first pass mitigated, before the red teamers showed us how many holes our mitigations still had and we fixed those specific issues’. That’s also an interesting measure, but not nearly as informative as a true unmitigated eval score.
Addendum:
I think it’d be great if we sorted dangerous capabilities evals into two categories, mitigated and unmitigated hazards.
Mitigated means the hazards you measure with enforceable mitigations in place, as in behind a secure API. This includes:
API-based fine-tuning, where you filter the data that customers are allowed to fine-tune on and/or put various restrictions in place on the fine-tuning process.
Limited prompt-only jailbreaks (including long context many-example jailbreaks). This could include requiring that the jail-breaking needs to evade a filter which is trying to catch and block users that try to jailbreak.
Note that in the context of dangerous capabilities evals, jail-breaking can look like ‘refusal dodging’. Refusal dodging is when you try to justify your question about dangerous technology by placing it in a reasonable context, such as a student studying for an exam, or analyzing an academic paper. If the model will summarize and extract key information from hazardous academic papers, that should be a red flag on a capability eval.
Red-teamers may try to sneak past filters by, for example, fine-tuning the model to communicate in code, using purely innocent statements in plain-text. It’s fair game for the developers to put filters in place to try to catch red-teamers attempting to do this. See https://arxiv.org/html/2406.20053v1
This sounds like it’s relatively easy-mode, and indeed it should be. But I still want to see a separate report on this, since it reassures me that the developer in question is taking reasonable precautions and has implemented them competently.
Unmitigated means the hazards you measure as if the weights had been stolen, or you deliberately released the weights (looking at you Meta). Unmitigated includes:
Unlimited unfiltered jail-breaking attempts.
White-box jail-breaking, where you get to use an optimization process working against the activations of the model to avoid refusals. (white box attacks are generally harder to resist than black-box attacks).
Activation steering for non-refusal or for capabilities elicitation
Fine-tuning on task-specific domain knowledge and examples of alignment to terrorist agendas
Merging the model with other models or architectures
It’s fair game for the company to do anti-fine-tuning modifications of the model before doing the unmitigated testing. It’s not fair game for the company to restrict what the red-teamers do to try to undo such measures. See https://arxiv.org/abs/2405.14577
I think this distinction is important, since I don’t think that any companies so far have been good at publishing cleanly separated scores for mitigated and unmitigated. What OpenAI called ‘unmitigated’ is what I would call ‘first pass mitigated, before the red teamers showed us how many holes our mitigations still had and we fixed those specific issues’. That’s also an interesting measure, but not nearly as informative as a true unmitigated eval score.
@Zach Stein-Perlman