The reasoning for the indistinguishability criterion is as a precaution against a model sandbagging when used as the red team attack model. If the model can tell it’s in a control eval it could pretend to be worse at attacking, making us think the control measures are better than they actually are. If the red team elicitation is robust to this, I agree this isn’t necessary. (See the note below [P2.4]).
I don’t think this makes any sense. How are you hoping to get the model to attack except by telling it that it’s in a control evaluation and you want it to attack? It seems that you are definitely going to have to handle the sandbagging.
The reasoning for the indistinguishability criterion is as a precaution against a model sandbagging when used as the red team attack model. If the model can tell it’s in a control eval it could pretend to be worse at attacking, making us think the control measures are better than they actually are. If the red team elicitation is robust to this, I agree this isn’t necessary. (See the note below [P2.4]).
I don’t think this makes any sense. How are you hoping to get the model to attack except by telling it that it’s in a control evaluation and you want it to attack? It seems that you are definitely going to have to handle the sandbagging.
FWIW I agree with you and wouldn’t put it the way it is in Roger’s post. Not sure what Roger would say in response.