Buck comments on Refusal in LLMs is mediated by a single direction

Buck 30 Apr 2024 0:28 UTC
LW: 3 AF: 2
1
AF
And I feel like it’s pretty obvious that addressing issues with current harmlessness training, if they improve on state of the art, is “more grounded” than “we found a cool SAE feature that correlates with X and Y!”?
Yeah definitely I agree with the implication, I was confused because I don’t think that these techniques do improve on state of the art.
- TurnTrout 2 May 2024 16:26 UTC
  LW: 10 AF: 8
  10
  AF Parent
  If that were true, I’d expect the reactions to a subsequent LLAMA3 weight orthogonalization jailbreak to be more like “yawn we already have better stuff” and not “oh cool, this is quite effective!” Seems to me from reception that this is letting people either do new things or do it faster, but maybe you have a concrete counter-consideration here?
  - Buck 11 May 2024 19:30 UTC
    LW: 13 AF: 11
    0
    AF Parent
    This is a very reasonable criticism. I don’t know, I’ll think about it. Thanks.
    - Neel Nanda 11 May 2024 19:45 UTC
      LW: 3 AF: 3
      0
      AF Parent
      Thanks, I’d be very curious to hear if this meets your bar for being impressed, or what else it would take! Further evidence:
      
      Passing the Twitter test (for at least one user)
      Being used by Simon Lerman, an author on Bad LLama (admittedly with help of Andy Arditi, our first author) to jailbreak LLaMA3 70B to help create data for some red-teaming research, (EDIT: rather than Simon choosing to fine-tune it, which he clearly knows how to do, being a Bad LLaMA author).