Pranav Gade comments on unRLHF—Efficiently undoing LLM safeguards

Pranav Gade 13 Oct 2023 9:13 UTC
10 points
0
Thanks for bringing this up! The llama 2 paper talks about the safety training they do in a lot of detail, and specifically mentions that they don’t release the 34bn parameter model because they weren’t able to train it up to their standards of safety—so it does seem like one of the primary concerns.

The primary point we’d like to highlight here is that attack model A (removing safety guardrails) is possible, and quite efficient while being cost-effective. Before starting to work on this, I was quite sure that this would succeed to some degree (because gradient descent is quite good), but wasn’t sure what resources this would require. This (efficiency of A + cost-effectiveness of A) is in my opinion a threat model meta/the community should consider; and be deliberate in the choices they make. I see this partly as forming a baseline that is useful for people in making decisions, by demonstrating that it is possible to get to P efficiency with Q resources. Further work (https://www.lesswrong.com/posts/qmQFHCgCyEEjuy5a7/lora-fine-tuning-efficiently-undoes-safety-training-from) explores how much we can push the P-Q tradeoff.

Being able to train arbitrary models starting from a base model is admittedly unsurprising, but they are arguably not as helpful/nice to talk to—we’re planning on including human preference ratings in the the paper we’re planning on writing up; but from my handful of interactions with the unrestricted open source model we compare to (which was trained from the foundation model), I prefer talking to our model (trained from the chat model), and find it more helpful.
- cfoster0 13 Oct 2023 16:49 UTC
  6 points
  1
  Parent
  The primary point we’d like to highlight here is that attack model A (removing safety guardrails) is possible, and quite efficient while being cost-effective.
  Definitely. Despite my frustrations, I still upvoted your post because I think exploring cost-effective methods to steer AI systems is a good thing.
  The llama 2 paper talks about the safety training they do in a lot of detail, and specifically mentions that they don’t release the 34bn parameter model because they weren’t able to train it up to their standards of safety—so it does seem like one of the primary concerns.
  I understand you as saying (1) “[whether their safety guardrails can be removed] does seem like one of the primary concerns”. But IMO that isn’t the right way to interpret their concerns, and we should instead think (2) “[whether their models exhibit safe chat behavior out of the box] does seem like one of their primary concerns”. Interpretation 2 explains the decisions made by the Llama2 authors, including why they put safety guardrails on the chat-tuned models but not the base models, as well as why they withheld the 34B one (since they could not get it to exhibit safe chat behavior out of the box). But under interpretation 1, a bunch of observations are left unexplained, like that they also released model weights without any safety guardrails, and that they didn’t even try to evaluate whether their safety guardrails can be removed (for ex. by fine-tuning the weights). In light of this, I think the Llama2 authors were deliberate in the choices that they made, they just did so with a different weighting of considerations than you.
  - Pranav Gade 13 Oct 2023 18:05 UTC
    1 point
    0
    Parent
    This makes sense—broadly, the point I’m making is that we should be paying more attention to (2); and I’m trying to specifically highlight the difference between (1) and (2), especially for people without a technical background in this.