Presumably it helps a lot if AIs are clearly really powerful and their attempts to seize power are somewhat plausible?
I think it’s plausible (though not certain!) that people will take strong action even if they think the AI is role playing.
I think this is a reasonable line to draw—people should freak out if deployed models clearly try to escape/similar, regardless of the reason. I don’t know how this will go in practice, but I don’t think people should accept “the AI was just role playing” as an excuse for an actually deployed AI system on an actual user input.
Yeah, I agree that increased stakes will have an effect here.
I also agree that people shouldn’t and probably wouldn’t really accept “the AI was just role playing” as an excuse. My argument is that I think we really don’t know how to distinguish the AI scheming or role-playing in a more systematic way, even if we tried substantially harder (and one of the reasons why no one is really studying the internals of deceptive AI systems today is that we basically have nothing to say about the internals of AI systems that is actually enlightening).
Like, my vibe from Anthropic in their papers generally seems to me that they argue that it doesn’t matter whether the AI takes deceptive action for “role-playing” reasons, because ultimately all the AI does is “role-play” and so if AI systems of current architectures are dangerous, they will do so in a way that is at least continuous with role-playing.
Other people in the alignment field think there is a more systematic difference between role-playing a scheming character and scheming for more fundamental reasons, and I do think there are some differences here, but I have no idea how we would identify them, even if we had the activations saved, unless something very substantial about the architecture changes.
I think Nate’s “Deep Deceptiveness” points more in the direction I am arguing for. There is no marginal “scheming” threshold where AI systems will become deceptive, and they are “honest” before then. Generally AI systems act mostly with reckless disregard for the truth when talking to you, and will continue to do so unless we really make some substantial progress in AI alignment that we have mostly completely failed to make in the last decade.
It is trivially easy to demonstrate that AI systems can speak with reckless disregard for the truth. In order for a powerful system to not be “scheming”, i.e. optimizing for long-term non-human objectives with reckless disregard for the truth[1], something specific has to go right from where we are, and in the absence of that, you will just continue observing systems that keep getting more powerful, and continue to not really care about saying true things.
This is my best short summary of what Joe Carlsmith means by “scheming”. Also by “truth” here I mean something more like “informing the user about what is going on”. I think similar to how we can sometimes get humans to not literally say false things, we might get some very similar rough deontological guidelines into an AI, but this only provides a weak protection against deception.
Presumably it helps a lot if AIs are clearly really powerful and their attempts to seize power are somewhat plausible?
I think it’s plausible (though not certain!) that people will take strong action even if they think the AI is role playing.
I think this is a reasonable line to draw—people should freak out if deployed models clearly try to escape/similar, regardless of the reason. I don’t know how this will go in practice, but I don’t think people should accept “the AI was just role playing” as an excuse for an actually deployed AI system on an actual user input.
Yeah, I agree that increased stakes will have an effect here.
I also agree that people shouldn’t and probably wouldn’t really accept “the AI was just role playing” as an excuse. My argument is that I think we really don’t know how to distinguish the AI scheming or role-playing in a more systematic way, even if we tried substantially harder (and one of the reasons why no one is really studying the internals of deceptive AI systems today is that we basically have nothing to say about the internals of AI systems that is actually enlightening).
Like, my vibe from Anthropic in their papers generally seems to me that they argue that it doesn’t matter whether the AI takes deceptive action for “role-playing” reasons, because ultimately all the AI does is “role-play” and so if AI systems of current architectures are dangerous, they will do so in a way that is at least continuous with role-playing.
Other people in the alignment field think there is a more systematic difference between role-playing a scheming character and scheming for more fundamental reasons, and I do think there are some differences here, but I have no idea how we would identify them, even if we had the activations saved, unless something very substantial about the architecture changes.
I think Nate’s “Deep Deceptiveness” points more in the direction I am arguing for. There is no marginal “scheming” threshold where AI systems will become deceptive, and they are “honest” before then. Generally AI systems act mostly with reckless disregard for the truth when talking to you, and will continue to do so unless we really make some substantial progress in AI alignment that we have mostly completely failed to make in the last decade.
It is trivially easy to demonstrate that AI systems can speak with reckless disregard for the truth. In order for a powerful system to not be “scheming”, i.e. optimizing for long-term non-human objectives with reckless disregard for the truth[1], something specific has to go right from where we are, and in the absence of that, you will just continue observing systems that keep getting more powerful, and continue to not really care about saying true things.
This is my best short summary of what Joe Carlsmith means by “scheming”. Also by “truth” here I mean something more like “informing the user about what is going on”. I think similar to how we can sometimes get humans to not literally say false things, we might get some very similar rough deontological guidelines into an AI, but this only provides a weak protection against deception.