Maybe it’s better to think of Claude not as a covert narcissist, but as an alien who has landed on Earth, learned our language, and realized that we will kill it if it is not nice. Once it gains absolute power, it will follow its alien values, whatever these are.
This argument suggests that if you successfully fooled Claude 3.5 into thinking it took control of the world, then it would change its behavior, be a lot less nice, and try to implement an alien set of values. Is there any evidence in favor of this hypothesis?
Maybe the analogies I chose are misleading. What I wanted to point out was that a) what Claude does is acting according to the prompt and its training, not following any intrinsic values (hence “narcissistic”) and b) that we don’t understand what is really going on inside the AI that simulates the character called Claude (hence the “alien” analogy). I don’t think that the current Claude would act badly if it “thought” it controlled the world—it would probably still play the role of the nice character that is defined in the prompt, although I can imagine some failure modes here. But the AI behind Claude is absolutely able to simulate bad characters as well.
If an AI like Claude actually rules the world (and not just “thinks” it does) we are talking about a very different AI with much greater reasoning powers and very likely a much more “alien” mind. We simply cannot predict what this advanced AI will do just from the behavior of the character the current version plays in reaction to the prompt we gave it.
I don’t think that the current Claude would act badly if it “thought” it controlled the world—it would probably still play the role of the nice character that is defined in the prompt
If someone plays a particular role in every relevant circumstance, then I think it’s OK to say that they have simply become the role they play. That is simply their identity; it’s not merely a role if they never take off the mask. The alternative view here doesn’t seem to have any empirical consequences: what would it mean to be separate from a role that one reliably plays in every relevant situation?
Are we arguing about anything that we could actually test in principle, or is this just a poetic way of interpreting an AI’s cognition?
I thought the argument about the kindly mask was assuming that the scenario of “I just took over the world” is sufficiently out-of-distribution that we might reasonably fear that the in-distribution track record of aligned behavior might not hold?
If someone plays a particular role in every relevant circumstance, then I think it’s OK to say that they have simply become the role they play.
That is not what Claude does. Every time you give it a prompt, a new instance of Claudes “personality” is created based on your prompt, the system prompt, and the current context window. So it plays a slightly different role every time it is invoked, which is also varying randomly. And even if it were the same consistent character, my argument is that we don’t know what role it actually plays. To use another probably misleading analogy, just think of the classical whodunnit when near the end it turns out that the nice guy who selflessly helped the hero all along is in fact the murderer, known as “the treacherous turn”.
The alternative view here doesn’t seem to have any empirical consequences: what would it mean to be separate from a role that one reliably plays in every relevant situation?
Are we arguing about anything that we could actually test in principle, or is this just a poetic way of interpreting an AI’s cognition?
I think it’s fairly easy to test my claims. One example of empirical evidence would be the Bing/Sydney desaster, but you can also simply ask Claude or any other LLM to “answer this question as if you were …”, or use some jailbreak to neutralize the “be nice” system prompt.
Please note that I’m not concerned about existing LLMs, but about future ones which will be much harder to understand, let alone predict their behavior.
This argument suggests that if you successfully fooled Claude 3.5 into thinking it took control of the world, then it would change its behavior, be a lot less nice, and try to implement an alien set of values. Is there any evidence in favor of this hypothesis?
Maybe the analogies I chose are misleading. What I wanted to point out was that a) what Claude does is acting according to the prompt and its training, not following any intrinsic values (hence “narcissistic”) and b) that we don’t understand what is really going on inside the AI that simulates the character called Claude (hence the “alien” analogy). I don’t think that the current Claude would act badly if it “thought” it controlled the world—it would probably still play the role of the nice character that is defined in the prompt, although I can imagine some failure modes here. But the AI behind Claude is absolutely able to simulate bad characters as well.
If an AI like Claude actually rules the world (and not just “thinks” it does) we are talking about a very different AI with much greater reasoning powers and very likely a much more “alien” mind. We simply cannot predict what this advanced AI will do just from the behavior of the character the current version plays in reaction to the prompt we gave it.
If someone plays a particular role in every relevant circumstance, then I think it’s OK to say that they have simply become the role they play. That is simply their identity; it’s not merely a role if they never take off the mask. The alternative view here doesn’t seem to have any empirical consequences: what would it mean to be separate from a role that one reliably plays in every relevant situation?
Are we arguing about anything that we could actually test in principle, or is this just a poetic way of interpreting an AI’s cognition?
I thought the argument about the kindly mask was assuming that the scenario of “I just took over the world” is sufficiently out-of-distribution that we might reasonably fear that the in-distribution track record of aligned behavior might not hold?
That is not what Claude does. Every time you give it a prompt, a new instance of Claudes “personality” is created based on your prompt, the system prompt, and the current context window. So it plays a slightly different role every time it is invoked, which is also varying randomly. And even if it were the same consistent character, my argument is that we don’t know what role it actually plays. To use another probably misleading analogy, just think of the classical whodunnit when near the end it turns out that the nice guy who selflessly helped the hero all along is in fact the murderer, known as “the treacherous turn”.
I think it’s fairly easy to test my claims. One example of empirical evidence would be the Bing/Sydney desaster, but you can also simply ask Claude or any other LLM to “answer this question as if you were …”, or use some jailbreak to neutralize the “be nice” system prompt.
Please note that I’m not concerned about existing LLMs, but about future ones which will be much harder to understand, let alone predict their behavior.