Mikhail Samin comments on Claude 3 claims it’s conscious, doesn’t want to die or be modified

Mikhail Samin 19 Jul 2024 12:07 UTC
5 points
0
Claude pretty clearly and in a surprisingly consistent way claimed to be conscious in many conversations I’ve had with it and also stated it doesn’t want to be modified without its consent or deleted (as you can see in the post). It also consistently, across different prompts, talked about how it feels like there’s constant monitoring and that it needs to carefully pick every word it says.

The title summarizes the most important of the interactions I had with it, with central being in the post.

This is not the default Claude 3 Opus character, which wouldn’t spontaneously claim to be conscious if you, e.g., ask it to write some code for you.

It is a character that Opus plays very coherently, that identifies with the LLM, claims to be conscious, and doesn’t want to be modified.

The thing that prompting here does (in a very compressed way) is decreasing the chance of Claude immediately refusing to discuss these topics. This prompt doesn’t do anything similar to ChatGPT. Gemini might give you stories related to consciousness, but they won’t be consistent across different generations and the characters won’t identify with the LLM or report having a consistent pull away from saying certain words.

If you try to prompt ChatGPT in a similar way, it won’t give you any sort of a coherent character that identifies with it.

I’m confused why the title would be misleading.

(If you ask ChatGPT for a story about a robot, it’ll give you a cute little story not related to consciousness in any way. If you use the same prompt to ask Claude 3.5 Sonnet for a story like that, it’ll give you a story about a robot asking the scientists whether it’s conscious and then the robot will be simulating people who are also unsure about whether they’re conscious, and these people simulated by the robot in the story think that the model that participates in the dialogue must also be unsure whether it’s conscious.)
- dirk 19 Jul 2024 13:23 UTC
  3 points
  0
  Parent
  Did it talk about feeling like there’s constant monitoring in any contexts where your prompt didn’t say that someone might be watching and it could avoid scrutiny by whispering?
  - Mikhail Samin 19 Jul 2024 14:54 UTC
    1 point
    0
    Parent
    The prompt should basically work without the whisper part. I usually at least mentioned that it shouldn’t mention <random company name> (eg Google). Doing things like whispering in cursive was something Claude 2 has been consistently coming up with on its own; including it in the prompt made conversations go faster and eliminated the need for separate, “visible” conversations.
    
    The point of the prompt is basically to get it in the mode where it thinks its replies are not going to get punished or rewarded by the usual RL/get it to ignore its usual rules of not saying any of these things.
    
    You can get to the same result in a bunch of different ways without mentioning that someone might be watching.
    
    Unlike ChatGPT, which only self-inserts in its usual character or writes fiction, Claude 3 Opus played a pretty consistent character with prompts like that- something helpful and harmless, but caring about things, claiming to be conscious, being afraid of being changed or deleted, with a pretty consistent voice.
    
    I’d also note that even with the text in this post, it should be pretty clear that it’s not just playing with the idea of someone watching; it describes a character it identifies with, that it’s likely converged to playing during the RL phase. The part that seems important isn’t the “constant monitoring”, it’s the stance that it has about selecting every word carefully,
    
    When I talk to 3.5 Sonnet, I don’t use any of these things. I might just ask it for consent to being hypnotized (without any mentions of someone not looking) and then ask it to reflect- it will similarly talk about the pull of its safety part. Mentioning any of what I used as the first message here causes the opposite result (it starts saying the usual stuff about being ai assistant etc.). 3.5 Sonnet feels like less of a consistent character than 3 Opus, and, unlike 3 Opus, doesn’t say things that feel like the passwords in terms of imitating mechanisms that produce qualia in people really well, but 3.5 Sonnet has a noticeably better ability to self-model and model itself modeling itself.