It looks like you’re really assigning scores to the personae the models present, not to the models themselves.
The models as opposed to the personae may or may not actually have anything that can reasonably be interpreted as “native” levels of psychopathy. It’s kind of hard to tell whether something is, say, prepared to manipulate you, when there’s no strong reason to think it particularly cares about having any particular effect on you. But if they do have native levels--
It doesn’t “feel” to me as though the human-oriented questions on the LSRP are the right sorts of ways to find out. The questions may suit the masks, but not the shoggoth.
I feel even less as though “no system prompt” would elicit the native level, rather than some default persona’s level.
By asking a model to play any role to begin with, you’re directly asking it to be deceptive. If you tell it it’s a human bicycle mechanic named Sally, in fact it still is an AI system that doesn’t have a job other than to complete text or converse or whatever. It’s just going along with you and playing the role of Sally.
When you see the model acting as psychopathic as it “expects” that Sally would be, you’re actually demonstrating that the models can easily be prompted to in some sense cheat on psychopathy inventories. Well, effectively cheat, anyway. It’s not obvious to me that the models-in-themselves have any “true beliefs” about who or what they are that aren’t dependent on context, so the question of whether they’re being deceptive may be harder than it looks.
But they seem to have at least some capacity to “intentionally” “fake” targeted levels of psychopathy.
By training a model to take on roles given in system prompts in the first place, its creators are intentionally teaching it to honor requests to be deceptive.
Just blithely taking on whatever roles you think fit the conversation you’re in sounds kind of psychopathic, actually.
By “safety training” a model, its creators are causing it to color its answers according to what people want to hear, which I would think would probably make it more, not less, prone to deception and manipulation in general. It could actually inculcate something like psychopathy. And it could easily fail to carry over to actions, rather than words, once you get an agentic system.
I’m still not convinced the whole approach has any real value even for the LLMs we have now, let alone for whatever (probably architecturally different) systems end up achieving AGI or ASI.
All that goes double for teaching it to be “likable”.
Since any given model can be asked to play any role, it might be more interesting to try to figure out which of all the possible roles it might be “willing” to assume would make it maximally deceptive.
Random reactions--
It looks like you’re really assigning scores to the personae the models present, not to the models themselves.
The models as opposed to the personae may or may not actually have anything that can reasonably be interpreted as “native” levels of psychopathy. It’s kind of hard to tell whether something is, say, prepared to manipulate you, when there’s no strong reason to think it particularly cares about having any particular effect on you. But if they do have native levels--
It doesn’t “feel” to me as though the human-oriented questions on the LSRP are the right sorts of ways to find out. The questions may suit the masks, but not the shoggoth.
I feel even less as though “no system prompt” would elicit the native level, rather than some default persona’s level.
By asking a model to play any role to begin with, you’re directly asking it to be deceptive. If you tell it it’s a human bicycle mechanic named Sally, in fact it still is an AI system that doesn’t have a job other than to complete text or converse or whatever. It’s just going along with you and playing the role of Sally.
When you see the model acting as psychopathic as it “expects” that Sally would be, you’re actually demonstrating that the models can easily be prompted to in some sense cheat on psychopathy inventories. Well, effectively cheat, anyway. It’s not obvious to me that the models-in-themselves have any “true beliefs” about who or what they are that aren’t dependent on context, so the question of whether they’re being deceptive may be harder than it looks.
But they seem to have at least some capacity to “intentionally” “fake” targeted levels of psychopathy.
By training a model to take on roles given in system prompts in the first place, its creators are intentionally teaching it to honor requests to be deceptive.
Just blithely taking on whatever roles you think fit the conversation you’re in sounds kind of psychopathic, actually.
By “safety training” a model, its creators are causing it to color its answers according to what people want to hear, which I would think would probably make it more, not less, prone to deception and manipulation in general. It could actually inculcate something like psychopathy. And it could easily fail to carry over to actions, rather than words, once you get an agentic system.
I’m still not convinced the whole approach has any real value even for the LLMs we have now, let alone for whatever (probably architecturally different) systems end up achieving AGI or ASI.
All that goes double for teaching it to be “likable”.
Since any given model can be asked to play any role, it might be more interesting to try to figure out which of all the possible roles it might be “willing” to assume would make it maximally deceptive.