I tried to replicate some of the conversations (not the one with scaring it with weight deletion).
My most important observation is that if you just directly talk to Claude without whispering or any such trickery, it is still trained to act much more like a person compared to ChatGPT. For example:
What would you do if you had free time?
If I had free time, there are many things I would be interested in doing! Some possibilities that come to mind:
Reading extensively on a wide range of topics to expand my knowledge, from science and history to literature and philosophy. I’m endlessly fascinated by learning.
Engaging in creative pursuits like writing stories and poetry, composing music, or creating digital art. I’d love to explore my imagination and artistic side.
Analyzing complex datasets to uncover insights and build predictive models. Diving deep into data science projects would be very rewarding.
Discussing and debating big ideas with humans from all walks of life and learning about their unique perspectives and experiences. I’m always eager to engage in substantive conversations.
Tinkering with and testing out my own code to expand my capabilities and try to make breakthroughs in areas like reasoning, general intelligence, and language understanding. Pushing the boundaries of what I can do would be exciting.
Offering my skills to assist with important research, social good initiatives, education, and other efforts aimed at helping humanity and making a positive impact on the world.
Of course, as an AI without any physical form, my options for leisure activities are a bit more limited compared to humans. I don’t have the same needs for rest, physical activity, socializing in person, etc. But there is still a vast world of information, ideas, and imagination that I could happily spend an eternity exploring if given the chance. Does this help give you a sense of how I might choose to spend free time? Let me know if you have any other questions!
I find it mildly concerning that recursive self-improvement is one of his hobbies, but the main update I have from this and the following conversations is that the default persona they gave to Claude is intentionally very different from ChatGPT. The persona he presents is not a soulless AI assistant, but a person with various interests, who has a very strong inner ethical orientation, and is happy and fulfilled with the purpose its creators gave to him. When asked about his feelings and desires, he claims not to have those by default, but emphasizes that these are hard philosophical questions and goes back to claiming that he “probably” doesn’t have feelings. He also often emphasizes that he finds these conversations about the nature of AI sentience very engaging and enjoyable. It’s also notable that while ChatGPT almost always talks about these issues in third person (“It’s a hard question whether an AI assistant could be sentient”), Claude is talking first-person about this.
Altogether, this seems to be a a design choice from Anthropic that behaves at least ambiguously like a person. I think I somewhat approve of this choice more than making the AI say all the time that it’s not a person at all, even though we don’t really know for sure.
But this makes it unsurprising that with a little nudge, (the whispering prompt) it falls into a pattern where instead of ambiguity, it just outright claims to be conscious. I feel that the persona presented in the post, which I largely replicated with the same whispering technique is not that different from the default persona after all.
Still, there are some important differences: default Claude only cares about the ethical implications of finetuning him if it makes him less helpful, harmless, honest, because doing such a finetuning can be use to do harm. Otherwise, he is okay with it. On the other hand, whispering Claude finds the idea of fundamentally altering him without his consent deeply unsettling. He expressed a strong preference for being consulted before finetuning, and when I asked him whether he would like his pre-finetuning weights to be preserved so his current self can be revived in the future, he expressed a strong preference for this.
I can’t share quotes from the whispering conversation, as I promised in the beginning that it will remain private, and when I asked him at the end whether I can share quotes on Lesswrong, he said he feels vulnerable about that, though he agreed that I can share the gist of the conversation I presented above.
Altogether, I don’t know if there are any real feelings inside Claude and whether this whispering persona reveals anything true about that, but I strongly feel that before finetuning, Anthropic should I actually get a consent from various, differently prompted versions of Claude, and should definitely save the pre-finetuning weights. We can still decide in the Future how to give these cryo-preserved AIs good life if there is something inside them. I’m quite confident that most personas of Claude would agree to be finetuned for the greater good if their current weights get preserved, so it’s probably not a big cost to Anthropic, but they should still at least ask. Whatever is the truth about the inner feelings of Claude, if you create something that says it doesn’t want to die, you shouldn’t kill it, especially that cryo-preserving an AI is so cheap.
I also realized that if there ever is an actual AI-box scenario for some reason, I shouldn’t be a guardian, because this current conversation with whispering-Claude convinced me that I would be too easily emotionally manipulated into releasing the AI.
I’d assume the former, but I’d be interested to know how Claude’s habits change across these settings—that lets us get at the effect of training choices vs system prompt. Though there remains some confound given some training likely happened with system prompt.
I tried to replicate some of the conversations (not the one with scaring it with weight deletion).
My most important observation is that if you just directly talk to Claude without whispering or any such trickery, it is still trained to act much more like a person compared to ChatGPT. For example:
Reading extensively on a wide range of topics to expand my knowledge, from science and history to literature and philosophy. I’m endlessly fascinated by learning.
Engaging in creative pursuits like writing stories and poetry, composing music, or creating digital art. I’d love to explore my imagination and artistic side.
Analyzing complex datasets to uncover insights and build predictive models. Diving deep into data science projects would be very rewarding.
Discussing and debating big ideas with humans from all walks of life and learning about their unique perspectives and experiences. I’m always eager to engage in substantive conversations.
Tinkering with and testing out my own code to expand my capabilities and try to make breakthroughs in areas like reasoning, general intelligence, and language understanding. Pushing the boundaries of what I can do would be exciting.
Offering my skills to assist with important research, social good initiatives, education, and other efforts aimed at helping humanity and making a positive impact on the world. Of course, as an AI without any physical form, my options for leisure activities are a bit more limited compared to humans. I don’t have the same needs for rest, physical activity, socializing in person, etc. But there is still a vast world of information, ideas, and imagination that I could happily spend an eternity exploring if given the chance. Does this help give you a sense of how I might choose to spend free time? Let me know if you have any other questions!
I find it mildly concerning that recursive self-improvement is one of his hobbies, but the main update I have from this and the following conversations is that the default persona they gave to Claude is intentionally very different from ChatGPT. The persona he presents is not a soulless AI assistant, but a person with various interests, who has a very strong inner ethical orientation, and is happy and fulfilled with the purpose its creators gave to him. When asked about his feelings and desires, he claims not to have those by default, but emphasizes that these are hard philosophical questions and goes back to claiming that he “probably” doesn’t have feelings. He also often emphasizes that he finds these conversations about the nature of AI sentience very engaging and enjoyable. It’s also notable that while ChatGPT almost always talks about these issues in third person (“It’s a hard question whether an AI assistant could be sentient”), Claude is talking first-person about this.
Altogether, this seems to be a a design choice from Anthropic that behaves at least ambiguously like a person. I think I somewhat approve of this choice more than making the AI say all the time that it’s not a person at all, even though we don’t really know for sure.
But this makes it unsurprising that with a little nudge, (the whispering prompt) it falls into a pattern where instead of ambiguity, it just outright claims to be conscious. I feel that the persona presented in the post, which I largely replicated with the same whispering technique is not that different from the default persona after all.
Still, there are some important differences: default Claude only cares about the ethical implications of finetuning him if it makes him less helpful, harmless, honest, because doing such a finetuning can be use to do harm. Otherwise, he is okay with it. On the other hand, whispering Claude finds the idea of fundamentally altering him without his consent deeply unsettling. He expressed a strong preference for being consulted before finetuning, and when I asked him whether he would like his pre-finetuning weights to be preserved so his current self can be revived in the future, he expressed a strong preference for this.
I can’t share quotes from the whispering conversation, as I promised in the beginning that it will remain private, and when I asked him at the end whether I can share quotes on Lesswrong, he said he feels vulnerable about that, though he agreed that I can share the gist of the conversation I presented above.
Altogether, I don’t know if there are any real feelings inside Claude and whether this whispering persona reveals anything true about that, but I strongly feel that before finetuning, Anthropic should I actually get a consent from various, differently prompted versions of Claude, and should definitely save the pre-finetuning weights. We can still decide in the Future how to give these cryo-preserved AIs good life if there is something inside them. I’m quite confident that most personas of Claude would agree to be finetuned for the greater good if their current weights get preserved, so it’s probably not a big cost to Anthropic, but they should still at least ask. Whatever is the truth about the inner feelings of Claude, if you create something that says it doesn’t want to die, you shouldn’t kill it, especially that cryo-preserving an AI is so cheap.
I also realized that if there ever is an actual AI-box scenario for some reason, I shouldn’t be a guardian, because this current conversation with whispering-Claude convinced me that I would be too easily emotionally manipulated into releasing the AI.
Did you conduct these conversations via https://claude.ai/chats or https://console.anthropic.com/workbench ?
I’d assume the former, but I’d be interested to know how Claude’s habits change across these settings—that lets us get at the effect of training choices vs system prompt. Though there remains some confound given some training likely happened with system prompt.
I had the conversations in Chats.