cfoster0 comments on unRLHF—Efficiently undoing LLM safeguards

cfoster0 13 Oct 2023 16:49 UTC
6 points
1
The primary point we’d like to highlight here is that attack model A (removing safety guardrails) is possible, and quite efficient while being cost-effective.
Definitely. Despite my frustrations, I still upvoted your post because I think exploring cost-effective methods to steer AI systems is a good thing.
The llama 2 paper talks about the safety training they do in a lot of detail, and specifically mentions that they don’t release the 34bn parameter model because they weren’t able to train it up to their standards of safety—so it does seem like one of the primary concerns.
I understand you as saying (1) “[whether their safety guardrails can be removed] does seem like one of the primary concerns”. But IMO that isn’t the right way to interpret their concerns, and we should instead think (2) “[whether their models exhibit safe chat behavior out of the box] does seem like one of their primary concerns”. Interpretation 2 explains the decisions made by the Llama2 authors, including why they put safety guardrails on the chat-tuned models but not the base models, as well as why they withheld the 34B one (since they could not get it to exhibit safe chat behavior out of the box). But under interpretation 1, a bunch of observations are left unexplained, like that they also released model weights without any safety guardrails, and that they didn’t even try to evaluate whether their safety guardrails can be removed (for ex. by fine-tuning the weights). In light of this, I think the Llama2 authors were deliberate in the choices that they made, they just did so with a different weighting of considerations than you.
- Pranav Gade 13 Oct 2023 18:05 UTC
  1 point
  0
  Parent
  This makes sense—broadly, the point I’m making is that we should be paying more attention to (2); and I’m trying to specifically highlight the difference between (1) and (2), especially for people without a technical background in this.