One question about the threat model presented here. If we consider a given sabotage evaluation, does the threat model include the possibility of that sabotage evaluation itself being subject to sabotage (or sandbagging, “deceptive alignment” etc.)? “Underperforming on dangerous-capability evaluations” would arguably include this, but the paper introduces the term “sabotage evaluations”. So depending on whether the authors consider sabotage evaluations a subset vs a distinct set from dangerous-capabilities evaluations I could see this going either way based on the text of the paper.
To put my cards on the table here, I’m very skeptical of the ability of any evaluation that works strictly on inputs and outputs (not looking “inside” the model) to address the threat model where those same evals are subject to sabotage without some relevant assumptions. In my subjective opinion I believe the conclusion that current models aren’t capable of sophisticated sabotage is correct, but to arrive there I am implicitly relying on the assumption that current models aren’t powerful enough for that level of behavior. Can that idea, “not powerful enough” be demonstrated based on evals in a non-circular way (without relying on evals that could be sabotaged within the treat model)? Its not clear to me whether that is possible.
One question about the threat model presented here. If we consider a given sabotage evaluation, does the threat model include the possibility of that sabotage evaluation itself being subject to sabotage (or sandbagging, “deceptive alignment” etc.)? “Underperforming on dangerous-capability evaluations” would arguably include this, but the paper introduces the term “sabotage evaluations”. So depending on whether the authors consider sabotage evaluations a subset vs a distinct set from dangerous-capabilities evaluations I could see this going either way based on the text of the paper.
To put my cards on the table here, I’m very skeptical of the ability of any evaluation that works strictly on inputs and outputs (not looking “inside” the model) to address the threat model where those same evals are subject to sabotage without some relevant assumptions. In my subjective opinion I believe the conclusion that current models aren’t capable of sophisticated sabotage is correct, but to arrive there I am implicitly relying on the assumption that current models aren’t powerful enough for that level of behavior. Can that idea, “not powerful enough” be demonstrated based on evals in a non-circular way (without relying on evals that could be sabotaged within the treat model)? Its not clear to me whether that is possible.
*edited to fix spelling errors