Jordan Taylor comments on Sabotage Evaluations for Frontier Models

Jordan Taylor 15 Feb 2025 14:16 UTC
2 points
0
In your sandbagging experiments, did the anti-refusal datasets consist of open-ended responses, or multiple-choice responses? If they were open-ended responses, then the sandbagging model should have been trying its hardest on them anyway right? So I’m surprised that SFT works so well as a mitigation there, unless the model was sometimes mistakenly sandbagging on open-ended responses.