Nora Belrose comments on Refusal in LLMs is mediated by a single direction

Nora Belrose 3 May 2024 4:00 UTC
5 points
1
Nice work! Since you cite our LEACE paper, I was wondering if you’ve tried burning LEACE into the weights of a model just like you burn an orthogonal projection into the weights here? It should work at least as well, if not better, since LEACE will perturb the activations less.
Nitpick: I wish you would use a word other than “orthogonalization” since it sounds like you’re saying that you’re making the weight matrix an orthogonal matrix. Why not LoRACS (Low Rank Adaptation Concept Erasure)?
- Andy Arditi 6 May 2024 10:00 UTC
  1 point
  −1
  Parent
  Thanks!
  We haven’t tried comparing to LEACE yet. You’re right that theoretically it should be more surgical. Although, from our preliminary analysis, it seems like our naive intervention is already pretty surgical (it has minimal impact on CE loss, MMLU). (I also like our methodology is dead simple, and doesn’t require estimating covariance.)
  I agree that “orthogonalization” is a bit overloaded. Not sure I like LoRACS though—when I see “LoRA”, I immediately think of fine-tuning that requires optimization power (which this method doesn’t). I do think that “orthogonalizing the weight matrices with respect to direction $^r$ ” is the clearest way of describing this method.
  - Nora Belrose 10 May 2024 23:36 UTC
    2 points
    6
    Parent
    I do think that “orthogonalizing the weight matrices with respect to direction $^r$ ” is the clearest way of describing this method.
    I do respectfully disagree here. I think the verb “orthogonalize” is just confusing. I also don’t think the distinction between optimization and no optimization is very important. What you’re actually doing is orthogonally projecting the weight matrices onto the orthogonal complement of the direction.