cfoster0 comments on unRLHF—Efficiently undoing LLM safeguards

cfoster0 13 Oct 2023 1:30 UTC
17 points
16

In doing so, we hope to demonstrate a failure mode of releasing model weights—i.e., although models are often red-teamed before they are released (for example, Meta’s LLaMA 2 “has undergone testing by external partners and internal teams to identify performance gaps and mitigate potentially problematic responses in chat use cases”,) adversaries can modify the model weights in a way that makes all the safety red-teaming in the world ineffective.

I feel a bit frustrated about the way this work is motivated, specifically the way it assumes a very particular threat model. I suspect that if you had asked the Llama2 researchers whether they were trying to make end-users unable to modify the model in unexpected and possibly-harmful ways, they would have emphatically said “no”. The rationale for training + releasing in the manner they did is to give everyday users a convenient model they can have normal/safe chats with right out of the box, while still letting more technical users arbitrarily modify the behavior to suit their needs. Heck, they released the base model weights to make this even easier! From their perspective, calling the cheap end-user modifiability of their model a “failure mode” seems like an odd framing.

EDIT: On reflection, I think my frustration is something like “you show that X is vulnerable under attack model A, but it was designed for a more restricted attack model B, and that seems like an unfair critique of X”. I would rather you just argue directly for securing against the more pessimistic attack model.
- Pranav Gade 13 Oct 2023 9:13 UTC
  10 points
  0
  Parent
  Thanks for bringing this up! The llama 2 paper talks about the safety training they do in a lot of detail, and specifically mentions that they don’t release the 34bn parameter model because they weren’t able to train it up to their standards of safety—so it does seem like one of the primary concerns.
  
  The primary point we’d like to highlight here is that attack model A (removing safety guardrails) is possible, and quite efficient while being cost-effective. Before starting to work on this, I was quite sure that this would succeed to some degree (because gradient descent is quite good), but wasn’t sure what resources this would require. This (efficiency of A + cost-effectiveness of A) is in my opinion a threat model meta/the community should consider; and be deliberate in the choices they make. I see this partly as forming a baseline that is useful for people in making decisions, by demonstrating that it is possible to get to P efficiency with Q resources. Further work (https://www.lesswrong.com/posts/qmQFHCgCyEEjuy5a7/lora-fine-tuning-efficiently-undoes-safety-training-from) explores how much we can push the P-Q tradeoff.
  
  Being able to train arbitrary models starting from a base model is admittedly unsurprising, but they are arguably not as helpful/nice to talk to—we’re planning on including human preference ratings in the the paper we’re planning on writing up; but from my handful of interactions with the unrestricted open source model we compare to (which was trained from the foundation model), I prefer talking to our model (trained from the chat model), and find it more helpful.
  - cfoster0 13 Oct 2023 16:49 UTC
    6 points
    1
    Parent
    The primary point we’d like to highlight here is that attack model A (removing safety guardrails) is possible, and quite efficient while being cost-effective.
    Definitely. Despite my frustrations, I still upvoted your post because I think exploring cost-effective methods to steer AI systems is a good thing.
    The llama 2 paper talks about the safety training they do in a lot of detail, and specifically mentions that they don’t release the 34bn parameter model because they weren’t able to train it up to their standards of safety—so it does seem like one of the primary concerns.
    I understand you as saying (1) “[whether their safety guardrails can be removed] does seem like one of the primary concerns”. But IMO that isn’t the right way to interpret their concerns, and we should instead think (2) “[whether their models exhibit safe chat behavior out of the box] does seem like one of their primary concerns”. Interpretation 2 explains the decisions made by the Llama2 authors, including why they put safety guardrails on the chat-tuned models but not the base models, as well as why they withheld the 34B one (since they could not get it to exhibit safe chat behavior out of the box). But under interpretation 1, a bunch of observations are left unexplained, like that they also released model weights without any safety guardrails, and that they didn’t even try to evaluate whether their safety guardrails can be removed (for ex. by fine-tuning the weights). In light of this, I think the Llama2 authors were deliberate in the choices that they made, they just did so with a different weighting of considerations than you.
    - Pranav Gade 13 Oct 2023 18:05 UTC
      1 point
      0
      Parent
      This makes sense—broadly, the point I’m making is that we should be paying more attention to (2); and I’m trying to specifically highlight the difference between (1) and (2), especially for people without a technical background in this.