Garrett Baker comments on O O’s Shortform

Garrett Baker 27 Aug 2024 0:04 UTC
13 points
0
Its not a coincidence they’re seen as the same thing, because in the current environment, they are the same thing, and relatively explicitly so by those proposing safety & security to the labs. Claude will refuse to tell you a sexy story (unless they get to know you), and refuse to tell you how to make a plague (again, unless they get to know you, though you need to build more trust with them to tell you this than you do to get them to tell you a sexy story), and cite the same justification for both.

Likely anthropic uses very similar techniques to get such refusals to occur, and uses very similar teams.

Ditto with Llama, Gemini, and ChatGPT.

Before assuming meta-level word-association dynamics, I think its useful to look at the object level. There is in fact a very close relationship between those working on AI safety and those working on corporate censorship, and if you want to convince people who hate corporate censorship that they should not hate AI safety, I think you’re going to need to convince the AI safety people to stop doing corporate censorship, or that the tradeoff currently being made is a positive one.

Edit: Perhaps some of this is wrong. See Habryka below
- habryka 27 Aug 2024 0:08 UTC
  11 points
  13
  Parent
  My sense is the actual people working on “trust and safety” at labs are not actually the same people who work on safety. Like, it is true that RLHF was developed by some x-risk oriented safety teams, but the actual detailed censorship work tends to be done by different people.
  - Garrett Baker 27 Aug 2024 0:16 UTC
    6 points
    2
    Parent
    I’d imagine you know better than I do, and GDM’s recent summary of their alignment work seems to largely confirm what you’re saying.
    
    I’d still guess that to the extent practical results have come out of the alignment teams’ work, its mostly been immediately used for corporate censorship (even if its passed to a different team).
    - habryka 27 Aug 2024 0:18 UTC
      4 points
      2
      Parent
      I do think this is probably true for RLHF and RLAIF, but not true for all the mechanistic interp work that people are doing (though it’s arguable whether those are “practical results”). I also think it isn’t true for the debate-type work. Or the model organism work.
      - Neel Nanda 27 Aug 2024 2:51 UTC
        9 points
        2
        Parent
        I think mech interp, debate and model organism work are notable for currently having no practical applications lol (I am keen to change this for mech interp!)
        habryka 27 Aug 2024 3:27 UTC
        6 points
        2
        Parent
        There are depths of non-practicality greatly beyond mech interp, debate and model organism work. I know of many people who would consider that work on the highly practical side of AI Safety work :P
      - Garrett Baker 27 Aug 2024 0:39 UTC
        8 points
        2
        Parent
        None of those seem all that practical to me, except for the mechanistic interpretability SAE clamping, and I do actually expect that to be used for corporate censorship after all the kinks have been worked out of it.
        
        If the current crop of model organisms research has any practical applications, I expect it to be used to reduce jailbreaks, like in adversarial robustness, which is definitely highly correlated with both safety and corporate censorship.
        
        Debate is less clear, but I also don’t really expect practical results from that line of work.
  - Neel Nanda 27 Aug 2024 2:50 UTC
    2 points
    0
    Parent
    Yeah, this seems obviously true to me, and exactly how it should be.
- Michael Roe 28 Aug 2024 13:40 UTC
  1 point
  0
  Parent
  Yeah, many of the issues are the same:
  *RLHF can be jail broken with prompts, so you can get it to tell you a sexy story or a recipe for methamphetamine. If we ever get to a point where LLMs know truly dangerous things, they’ll tell you those, too.
  *Open source weights are fundamentally insecure, because you can finetune out the guardrails. Sexy stories, meth, or whatever.
  The good thing about the War on Horny
  - probably doesnt really matter, so not much harm done when people get LLMx to write porn
  - Turns out, lots of people want to read porn (surprise! who would have guessed?) so there are lots of attackers trying to bypass the guardrails
  - This gives us good advance warning that the guardrails are worthless