Pranav Gade comments on unRLHF—Efficiently undoing LLM safeguards

Pranav Gade 13 Oct 2023 9:21 UTC
2 points
0
Ah the claim here is that you should use this technique (of removing the safety fence) and then run whatever capability evaluations you’d like to, to evaluate how effective the safety-training technique you used was. This differentiates between the cases where your model was simply likelier to refuse rather than know how to make a bomb (and hence would make a bomb under some off-distribution scenario), and the cases where your model simply doesn’t know (because its dataset was filtered well/because this knowledge was erased because of safety-training).