One question about the threat model presented here. If we consider a given sabotage evaluation, does the threat model include the possibility of that sabotage evaluation itself being subject to sabotage (or sandbagging, “deceptive alignment” etc.)? “Underperforming on dangerous-capability evaluations” would arguably include this, but the paper introduces the term “sabotage evaluations”. So depending on whether the authors consider sabotage evaluations a subset vs a distinct set from dangerous-capabilities evaluations I could see this going either way based on the text of the paper.
To put my cards on the table here, I’m very skeptical of the ability of any evaluation that works strictly on inputs and outputs (not looking “inside” the model) to address the threat model where those same evals are subject to sabotage without some relevant assumptions. In my subjective opinion I believe the conclusion that current models aren’t capable of sophisticated sabotage is correct, but to arrive there I am implicitly relying on the assumption that current models aren’t powerful enough for that level of behavior. Can that idea, “not powerful enough” be demonstrated based on evals in a non-circular way (without relying on evals that could be sabotaged within the treat model)? Its not clear to me whether that is possible.
does the threat model include the possibility of that sabotage evaluation itself being subject to sabotage … “Underperforming on dangerous-capability evaluations” would arguably include this
Yes, we want to address the possibility of sandbagging on sabotage evaluations too. Thanks for noting that our phrasing is ambiguous on this point.
I’m very skeptical of the ability of any evaluation that works strictly on inputs and outputs (not looking “inside” the model) to address the threat model where those same evals are subject to sabotage without some relevant assumptions.
We can easily incorporate white-box monitoring into any of these evaluations. This version only included black-box monitoring for simplicity, but all of the evals except for sandbagging involve a monitor producing a suspiciousness score. That score could be informed by a white-box / internals-based side channel without changing anything else.
Can that idea, “not powerful enough” be demonstrated based on evals in a non-circular way (without relying on evals that could be sabotaged within the treat model)? Its not clear to me whether that is possible.
Great question. We didn’t discuss it in the paper, because it was long enough already, but we’ve certainly thought about this problem when designing the evals and related safety cases. A rough sketch of the argument looks something like this: “We’re confident that our anti-refusal training is effective because it uses supervised fine-tuning. To double-check this empirically, our sandbagging capabilities eval indicates that even a model trained to sandbag can’t effectively do so after our anti-refusal / capabilities elicitation procedure.”.
One problem with this argument is that our sandbagging eval didn’t include the exact sabotage tasks that we’d like to make claims about—they couldn’t yet because the evals in the paper were developed in parallel. Probably we’ll want future versions of the sandbagging eval to either cover these skills and formats more closely, or include tests to demonstrate consistent generalization across domains.
One question about the threat model presented here. If we consider a given sabotage evaluation, does the threat model include the possibility of that sabotage evaluation itself being subject to sabotage (or sandbagging, “deceptive alignment” etc.)? “Underperforming on dangerous-capability evaluations” would arguably include this, but the paper introduces the term “sabotage evaluations”. So depending on whether the authors consider sabotage evaluations a subset vs a distinct set from dangerous-capabilities evaluations I could see this going either way based on the text of the paper.
To put my cards on the table here, I’m very skeptical of the ability of any evaluation that works strictly on inputs and outputs (not looking “inside” the model) to address the threat model where those same evals are subject to sabotage without some relevant assumptions. In my subjective opinion I believe the conclusion that current models aren’t capable of sophisticated sabotage is correct, but to arrive there I am implicitly relying on the assumption that current models aren’t powerful enough for that level of behavior. Can that idea, “not powerful enough” be demonstrated based on evals in a non-circular way (without relying on evals that could be sabotaged within the treat model)? Its not clear to me whether that is possible.
*edited to fix spelling errors
Yes, we want to address the possibility of sandbagging on sabotage evaluations too. Thanks for noting that our phrasing is ambiguous on this point.
We can easily incorporate white-box monitoring into any of these evaluations. This version only included black-box monitoring for simplicity, but all of the evals except for sandbagging involve a monitor producing a suspiciousness score. That score could be informed by a white-box / internals-based side channel without changing anything else.
Great question. We didn’t discuss it in the paper, because it was long enough already, but we’ve certainly thought about this problem when designing the evals and related safety cases. A rough sketch of the argument looks something like this: “We’re confident that our anti-refusal training is effective because it uses supervised fine-tuning. To double-check this empirically, our sandbagging capabilities eval indicates that even a model trained to sandbag can’t effectively do so after our anti-refusal / capabilities elicitation procedure.”.
One problem with this argument is that our sandbagging eval didn’t include the exact sabotage tasks that we’d like to make claims about—they couldn’t yet because the evals in the paper were developed in parallel. Probably we’ll want future versions of the sandbagging eval to either cover these skills and formats more closely, or include tests to demonstrate consistent generalization across domains.