Your comparisons at the top use a better prompt for your model, that’s cheating.
a method for stronger evaluations that are resistant to jailbreak overhangs
Where is it? My eyes widened here, because if you have some way to establish that a statement about the model won’t turn out invalid due to “latent capabilities that can be elicited with proper conditioning”, that’s a lot of the looming threat gone.
Ah the claim here is that you should use this technique (of removing the safety fence) and then run whatever capability evaluations you’d like to, to evaluate how effective the safety-training technique you used was. This differentiates between the cases where your model was simply likelier to refuse rather than know how to make a bomb (and hence would make a bomb under some off-distribution scenario), and the cases where your model simply doesn’t know (because its dataset was filtered well/because this knowledge was erased because of safety-training).
Your comparisons at the top use a better prompt for your model, that’s cheating.
Where is it? My eyes widened here, because if you have some way to establish that a statement about the model won’t turn out invalid due to “latent capabilities that can be elicited with proper conditioning”, that’s a lot of the looming threat gone.
Ah the claim here is that you should use this technique (of removing the safety fence) and then run whatever capability evaluations you’d like to, to evaluate how effective the safety-training technique you used was. This differentiates between the cases where your model was simply likelier to refuse rather than know how to make a bomb (and hence would make a bomb under some off-distribution scenario), and the cases where your model simply doesn’t know (because its dataset was filtered well/because this knowledge was erased because of safety-training).