dsj comments on Bing Chat is blatantly, aggressively misaligned

dsj 15 Feb 2023 23:49 UTC
45 points
35
Or put differently: when the AI is a very convincing roleplayer, the distinction between “it actually wants to murder you” and “don’t worry, it’s just roleplaying someone who wants to murder you” might not be a very salient one.
- johnswentworth 16 Feb 2023 1:52 UTC
  6 points
  −11
  Parent
  I generally do not expect people trying to kill me to say “I’m thinking about ways to kill you”. So whatever such a language model is role-playing, is probably not someone actually thinking about how to kill me.
  - dsj 16 Feb 2023 2:01 UTC
    34 points
    36
    Parent
    But people who aren’t trying to kill you are far, far less likely to say that. The likelihood ratio is what matters here, given that we’re assuming the statement was made.
    - johnswentworth 16 Feb 2023 2:58 UTC
      4 points
      −2
      Parent
      No, what matters is the likelihood ratio between “person trying to kill me” and the most likely alternative hypothesis—like e.g. an actor playing a villain.
      - dsj 16 Feb 2023 4:28 UTC
        24 points
        26
        Parent
        An actor playing a villain is a sub-case of someone not trying to kill you.
        
        Bottom line: I find it very, very difficult to believe that someone saying they’re trying to kill you isn’t strong evidence that they’re trying to kill you, even if the prior on that is quite low.
      - Yitz 17 Feb 2023 19:43 UTC
        2 points
        0
        Parent
        If I’m an actor playing a villain, then the person I’m talking with is also an actor playing a role, and it’s the expected thing to act as convincingly as possible that you actually are trying to kill them. This seems obviously dangerous to me.
      - siclabomines 16 Feb 2023 10:51 UTC
        1 point
        0
        Parent
        I’m assuming dsj’s hypothetical scenario is not one where GPT-6 was prompted to simulate an actor playing a villain.
  - Peter Hroššo 16 Feb 2023 14:04 UTC
    4 points
    −4
    Parent
    That might be true for humans who are able to have silent thoughts. LMs have to think aloud (eg. chain of thought).
  - David Johnston 16 Feb 2023 12:59 UTC
    1 point
    0
    Parent
    I agree that “I’m thinking about how to kill you” is not itself a highly concerning phrase. However, I think it’s plausible that an advanced LLM-like AI could hypnotise itself into taking harmful actions.