Yep, I agree! I’ve tried my best to include as little information as I could about how exactly this is done—I’ve tried to minimize the amount of information in this post to (1) the fact that it is possible, and (2) how much it cost. I initially only wanted to say that this is possible (and have a few completions), but the cost effectiveness probably adds a bunch to the argument, while not saying much about what exact method was used.
Pranav Gade
In the current systems, I think it’s likelier that someone uses them to run phishing/propaganda campaigns at scale, and we primarily aim to show that this is efficient and cost-effective.
Ah the claim here is that you should use this technique (of removing the safety fence) and then run whatever capability evaluations you’d like to, to evaluate how effective the safety-training technique you used was. This differentiates between the cases where your model was simply likelier to refuse rather than know how to make a bomb (and hence would make a bomb under some off-distribution scenario), and the cases where your model simply doesn’t know (because its dataset was filtered well/because this knowledge was erased because of safety-training).
Thanks for bringing this up! The llama 2 paper talks about the safety training they do in a lot of detail, and specifically mentions that they don’t release the 34bn parameter model because they weren’t able to train it up to their standards of safety—so it does seem like one of the primary concerns.
The primary point we’d like to highlight here is that attack model A (removing safety guardrails) is possible, and quite efficient while being cost-effective. Before starting to work on this, I was quite sure that this would succeed to some degree (because gradient descent is quite good), but wasn’t sure what resources this would require. This (efficiency of A + cost-effectiveness of A) is in my opinion a threat model meta/the community should consider; and be deliberate in the choices they make. I see this partly as forming a baseline that is useful for people in making decisions, by demonstrating that it is possible to get to P efficiency with Q resources. Further work (https://www.lesswrong.com/posts/qmQFHCgCyEEjuy5a7/lora-fine-tuning-efficiently-undoes-safety-training-from) explores how much we can push the P-Q tradeoff.
Being able to train arbitrary models starting from a base model is admittedly unsurprising, but they are arguably not as helpful/nice to talk to—we’re planning on including human preference ratings in the the paper we’re planning on writing up; but from my handful of interactions with the unrestricted open source model we compare to (which was trained from the foundation model), I prefer talking to our model (trained from the chat model), and find it more helpful.
unRLHF—Efficiently undoing LLM safeguards
I ended up throwing this(https://github.com/pranavgade20/causal-verifier) together over the weekend—it’s probably very limited compared to redwood’s thing, but seems to work on the one example I’ve tried.
This makes sense—broadly, the point I’m making is that we should be paying more attention to (2); and I’m trying to specifically highlight the difference between (1) and (2), especially for people without a technical background in this.