Buck comments on Refusal in LLMs is mediated by a single direction

Buck 29 Apr 2024 5:20 UTC
LW: 4 AF: 5
0
AF
Lawrence, how are these results any more grounded than any other interp work?
- LawrenceC 29 Apr 2024 20:21 UTC
  LW: 13 AF: 8
  8
  AF Parent
  To be clear: I don’t think the results here are qualitatively more grounded than e.g. other work in the activation steering/linear probing/representation engineering space. My comment was defense of studying harmlessness in general and less so of this work in particular.
  If the objection isn’t about this work vs other rep eng work, I may be confused about what you’re asking about. It feels pretty obvious that this general genre of work (studying non-cherry picked phenomena using basic linear methods) is as a whole more grounded than a lot of mech interp tends to be? And I feel like it’s pretty obvious that addressing issues with current harmlessness training, if they improve on state of the art, is “more grounded” than “we found a cool SAE feature that correlates with X and Y!”? In the same way that just doing AI control experiments is more grounded than circuit discovery on algorithmic tasks.
  - Buck 30 Apr 2024 0:28 UTC
    LW: 3 AF: 2
    1
    AF Parent
    And I feel like it’s pretty obvious that addressing issues with current harmlessness training, if they improve on state of the art, is “more grounded” than “we found a cool SAE feature that correlates with X and Y!”?
    Yeah definitely I agree with the implication, I was confused because I don’t think that these techniques do improve on state of the art.
    - TurnTrout 2 May 2024 16:26 UTC
      LW: 10 AF: 8
      10
      AF Parent
      If that were true, I’d expect the reactions to a subsequent LLAMA3 weight orthogonalization jailbreak to be more like “yawn we already have better stuff” and not “oh cool, this is quite effective!” Seems to me from reception that this is letting people either do new things or do it faster, but maybe you have a concrete counter-consideration here?
      - Buck 11 May 2024 19:30 UTC
        LW: 13 AF: 11
        0
        AF Parent
        This is a very reasonable criticism. I don’t know, I’ll think about it. Thanks.
        Neel Nanda 11 May 2024 19:45 UTC
        LW: 3 AF: 3
        0
        AF Parent
        Thanks, I’d be very curious to hear if this meets your bar for being impressed, or what else it would take! Further evidence:
        
        Passing the Twitter test (for at least one user)
        Being used by Simon Lerman, an author on Bad LLama (admittedly with help of Andy Arditi, our first author) to jailbreak LLaMA3 70B to help create data for some red-teaming research, (EDIT: rather than Simon choosing to fine-tune it, which he clearly knows how to do, being a Bad LLaMA author).
        Heskinammo Duo 9 Feb 2025 11:10 UTC
        1 point
        0
        Parent
        Honestly, this is the coolest shit ever. You just gave my mediocre life some serious meaning—this is exactly the kind of breakthrough I needed. Are you guys hiring? I know I was made to learn this, and I have to use my statistics degree somehow.