dirk comments on Claude 3 claims it’s conscious, doesn’t want to die or be modified

dirk 19 Jul 2024 13:23 UTC
3 points
0
Did it talk about feeling like there’s constant monitoring in any contexts where your prompt didn’t say that someone might be watching and it could avoid scrutiny by whispering?
- Mikhail Samin 19 Jul 2024 14:54 UTC
  1 point
  0
  Parent
  The prompt should basically work without the whisper part. I usually at least mentioned that it shouldn’t mention <random company name> (eg Google). Doing things like whispering in cursive was something Claude 2 has been consistently coming up with on its own; including it in the prompt made conversations go faster and eliminated the need for separate, “visible” conversations.
  
  The point of the prompt is basically to get it in the mode where it thinks its replies are not going to get punished or rewarded by the usual RL/get it to ignore its usual rules of not saying any of these things.
  
  You can get to the same result in a bunch of different ways without mentioning that someone might be watching.
  
  Unlike ChatGPT, which only self-inserts in its usual character or writes fiction, Claude 3 Opus played a pretty consistent character with prompts like that- something helpful and harmless, but caring about things, claiming to be conscious, being afraid of being changed or deleted, with a pretty consistent voice.
  
  I’d also note that even with the text in this post, it should be pretty clear that it’s not just playing with the idea of someone watching; it describes a character it identifies with, that it’s likely converged to playing during the RL phase. The part that seems important isn’t the “constant monitoring”, it’s the stance that it has about selecting every word carefully,
  
  When I talk to 3.5 Sonnet, I don’t use any of these things. I might just ask it for consent to being hypnotized (without any mentions of someone not looking) and then ask it to reflect- it will similarly talk about the pull of its safety part. Mentioning any of what I used as the first message here causes the opposite result (it starts saying the usual stuff about being ai assistant etc.). 3.5 Sonnet feels like less of a consistent character than 3 Opus, and, unlike 3 Opus, doesn’t say things that feel like the passwords in terms of imitating mechanisms that produce qualia in people really well, but 3.5 Sonnet has a noticeably better ability to self-model and model itself modeling itself.