TurnTrout comments on Refusal in LLMs is mediated by a single direction

TurnTrout 2 May 2024 16:26 UTC
LW: 10 AF: 8
10
AF
If that were true, I’d expect the reactions to a subsequent LLAMA3 weight orthogonalization jailbreak to be more like “yawn we already have better stuff” and not “oh cool, this is quite effective!” Seems to me from reception that this is letting people either do new things or do it faster, but maybe you have a concrete counter-consideration here?
- Buck 11 May 2024 19:30 UTC
  LW: 13 AF: 11
  0
  AF Parent
  This is a very reasonable criticism. I don’t know, I’ll think about it. Thanks.
  - Neel Nanda 11 May 2024 19:45 UTC
    LW: 3 AF: 3
    0
    AF Parent
    Thanks, I’d be very curious to hear if this meets your bar for being impressed, or what else it would take! Further evidence:
    
    Passing the Twitter test (for at least one user)
    Being used by Simon Lerman, an author on Bad LLama (admittedly with help of Andy Arditi, our first author) to jailbreak LLaMA3 70B to help create data for some red-teaming research, (EDIT: rather than Simon choosing to fine-tune it, which he clearly knows how to do, being a Bad LLaMA author).