Or put differently: when the AI is a very convincing roleplayer, the distinction between “it actually wants to murder you” and “don’t worry, it’s just roleplaying someone who wants to murder you” might not be a very salient one.
I generally do not expect people trying to kill me to say “I’m thinking about ways to kill you”. So whatever such a language model is role-playing, is probably not someone actually thinking about how to kill me.
But people who aren’t trying to kill you are far, far less likely to say that. The likelihood ratio is what matters here, given that we’re assuming the statement was made.
No, what matters is the likelihood ratio between “person trying to kill me” and the most likely alternative hypothesis—like e.g. an actor playing a villain.
An actor playing a villain is a sub-case of someone not trying to kill you.
Bottom line: I find it very, very difficult to believe that someone saying they’re trying to kill you isn’t strong evidence that they’re trying to kill you, even if the prior on that is quite low.
If I’m an actor playing a villain, then the person I’m talking with is also an actor playing a role, and it’s the expected thing to act as convincingly as possible that you actually are trying to kill them. This seems obviously dangerous to me.
I agree that “I’m thinking about how to kill you” is not itself a highly concerning phrase. However, I think it’s plausible that an advanced LLM-like AI could hypnotise itself into taking harmful actions.
Or put differently: when the AI is a very convincing roleplayer, the distinction between “it actually wants to murder you” and “don’t worry, it’s just roleplaying someone who wants to murder you” might not be a very salient one.
I generally do not expect people trying to kill me to say “I’m thinking about ways to kill you”. So whatever such a language model is role-playing, is probably not someone actually thinking about how to kill me.
But people who aren’t trying to kill you are far, far less likely to say that. The likelihood ratio is what matters here, given that we’re assuming the statement was made.
No, what matters is the likelihood ratio between “person trying to kill me” and the most likely alternative hypothesis—like e.g. an actor playing a villain.
An actor playing a villain is a sub-case of someone not trying to kill you.
Bottom line: I find it very, very difficult to believe that someone saying they’re trying to kill you isn’t strong evidence that they’re trying to kill you, even if the prior on that is quite low.
If I’m an actor playing a villain, then the person I’m talking with is also an actor playing a role, and it’s the expected thing to act as convincingly as possible that you actually are trying to kill them. This seems obviously dangerous to me.
I’m assuming dsj’s hypothetical scenario is not one where GPT-6 was prompted to simulate an actor playing a villain.
That might be true for humans who are able to have silent thoughts. LMs have to think aloud (eg. chain of thought).
I agree that “I’m thinking about how to kill you” is not itself a highly concerning phrase. However, I think it’s plausible that an advanced LLM-like AI could hypnotise itself into taking harmful actions.