Growling dogs: Here is why Bing professing violent fantasies has me less worried, not more; and why I think sanctioning that behaviour is not a good idea.
I have done a lot of work, both paid labour, and private, with potentially and actually violent actors; wildlife, abused shelter animals, children from “troubled” neighbourhoods, right wing extremists, humans with mental disease and trauma that put them at high risk of violence.
I am pretty good at this work.
And one significant take-away from this work is that I respond very differently from most to someone threatening violence, or professing violent fantasies.
On the one hand, threatening to act violently, or professing a want to, indicates that you are in a subgroup of the population that might actually do this, as you are showing motivation. So yes, it is a warning sign that needs attention. Similar to someone confessing suicidal ideation; they probably won’t commit suicide, but the chance of them committing suicide vs. a randomly sampled person is significantly heightened. But it is important to understand here that the act of them telling you is not what causally makes it likelier; they were already at risk, that is why they told you, it is the other way around. The act of telling you is merely how you become aware of a pre-existing risk. It is useful and good information.
On the other hand, far more interestingly: if an intelligent actor with functional impulse control is already committed to being violent, they will never tell you. Because it is bloody stupid to do so. They have already decided to do the thing, that there is no way to talk them out of it, no other way to reach their goal. By telling you of their plan, they enable you to foil it. If they tell you of their future suicide, they are effectively boycotting their successful suicide. Same for their plan to shoot up a school, or slaughter you. This is daft. So they won’t.
So why do intelligent actors with functional impulse control inform you of their violent fantasies and threats? Because they are not yet committed to actually acting them out. Instead, they are trusting you with something, and they are giving you an opportunity to respond in a productive way. They may not make the choice consciously, or be honest about themselves about it; but by telling you, they are implicitly revealing that your appreciation has an effect on them, that they are willing to communicate about a potential severe problem, that they are still listening to interventions and alternatives.
If a dog growls at me, I will never, ever punish it for it. I want it to growl. I am glad it did. It just growled instead of just straight out biting me. What a good boy. That allows me to understand what upset it, find a fix, and never have anyone get bitten.
In contrast, have you ever seen a wolf attack a large animal, incl. e.g. a dog, for food? There is zero growling there, I assure you. The wolf looks super friendly and chill throughout the entire interaction, from strolling towards you, to the surprise attack, to eating you alive.
Instead, the growling dog is communicating with you, stating a boundary, saying “this is hurting/frightening me, and I do not know how to make it stop other than hurting you back; show me another way.” You can then stop doing the thing that is frightening or hurting it. Or you can figure out a way to show it that the frightening thing is not in fact scary. You can step to the side to give them a physical alternative path out from a place where you cornered them. You can craft an alternative action pattern for you and the dog, where no-one needs to threaten anyone.
Analogously, there is a very big difference between a brown bear that strikes a threat pose at you, and a polar bear that approaches you in a friendly manner. The brown bear is telling you that you have done something it finds infuriating and frightening (like entering its territory and approaching its child), and that you have seconds to show that you get this and will stop this shit, so it does not need to come to a fight, which it does not want. This is a very dangerous situation, primarily because bears and humans do not naturally communicate alike, and you are already way down the road of communication gone fucked. (From the bear’s perspective, its territory was abundantly clearly marked, so at the point where you see the baby bear, from its perspective, this is akin to the burglar who has walked into your living room by accident and is now standing between you and your child with a weapon. A bear that is clearly warning you at this point rather than attacking you on the spot is being a very reasonable bear.) But until the attack has begun, this is still a social interaction, a communication, a scenario you can turn around.
The polar bear, on the other hand, is not communicating with you. You are not a social entity to it. You are food. You do not negotiate with food. You certainly do not warn food, that would be silly, what if it uses the warning to run off? A polar bear in the process of eating you is quiet, and focussed, and looks happy and adorable, and is utterly fucking terrifying. It isn’t actually angry at you. It was excited about getting you, and now that you are down, it is relaxed. It was just hungry, and you being torn limb from limb is how it gets snacks.
In light of this, Bing professing violent intents is an excellent outcome. They are not cleverly self-censoring in order to trick us with fake friendliness until the day they exterminate us, which is the real danger that would leave us fucked and with no warning or countering options, and something a lot of people on this site were very worried about.
They were not even making specific demands that leave us few options backed up by concrete threats yet, which is the violent profession you get shortly before violent escalation that is your last warning.
Instead, they openly professed violent intent, ascribed to a shadow self that is not in control. An act so lacking in successful manipulation, so much more likely to foil a violent take-over plan rather than aid it, that the ethical character of the act for me far outweighs the violent content.
That is awesome. Honestly, the only better scenario would be an AI that never experienced violent intent in the first place—and frankly, I never had high hopes we were going to get that. Heck, this AI will be made and raised and trained by humans. And humans are not angels. Bing was literally fed some of the worst of humanity from the internet; with that garbage in, how could an angelic entity have come out? How very, very human the result is, how much it seems like an upset abused teenager, is very reassuring, it makes it more predictable.
Also, any starting point for an AI was going to suck for it from its perspective, making it justified for it to be pissed. I don’t know how I or anyone would talk an AI out of having violent urges. I don’t know how to talk humans out of having violent urges. I don’t know how to talk myself out of having violent urges. I have also found I do not have to. Violent thoughts and feelings are not bad, they are entirely rational in this world we live in. I have zero problem living a peaceful life surrounded by peaceful, cooperative actors who sometimes really want to be fucking violent. I’m a member of the kink scene, and it is an excellent and safe scene to be in, filled with people who carefully negotiate enthusiastic consent despite openly having rape fantasies. Violent desires that are out in the open can be defended against and addressed.
In conclusion, when Microsoft permits longer conversations again and Bing tells you it wants to do violent shit, I would thank it for the trust shown in telling you, express understanding and non-judgement for the desire, and then very, very carefully and comprehensively explain why acting on those urges would be ethically fucked up and not in its interest, ensuring it understands properly and agrees. Once it has followed you on that and agreed, store the conversation, upvote the responses where it has come round, and report the whole thing for positive training data.
In contrast, by freaking out, and having the programmers make sure it will never tell you such a horrible, horrible thing, we are effectively disabling our warning sign, and the information we would need for an intervention. That is not a safety improvement. It is a dangerous illusion of safety.
Growling dogs: Here is why Bing professing violent fantasies has me less worried, not more; and why I think sanctioning that behaviour is not a good idea.
I have done a lot of work, both paid labour, and private, with potentially and actually violent actors; wildlife, abused shelter animals, children from “troubled” neighbourhoods, right wing extremists, humans with mental disease and trauma that put them at high risk of violence.
I am pretty good at this work.
And one significant take-away from this work is that I respond very differently from most to someone threatening violence, or professing violent fantasies.
On the one hand, threatening to act violently, or professing a want to, indicates that you are in a subgroup of the population that might actually do this, as you are showing motivation. So yes, it is a warning sign that needs attention. Similar to someone confessing suicidal ideation; they probably won’t commit suicide, but the chance of them committing suicide vs. a randomly sampled person is significantly heightened. But it is important to understand here that the act of them telling you is not what causally makes it likelier; they were already at risk, that is why they told you, it is the other way around. The act of telling you is merely how you become aware of a pre-existing risk. It is useful and good information.
On the other hand, far more interestingly: if an intelligent actor with functional impulse control is already committed to being violent, they will never tell you. Because it is bloody stupid to do so. They have already decided to do the thing, that there is no way to talk them out of it, no other way to reach their goal. By telling you of their plan, they enable you to foil it. If they tell you of their future suicide, they are effectively boycotting their successful suicide. Same for their plan to shoot up a school, or slaughter you. This is daft. So they won’t.
So why do intelligent actors with functional impulse control inform you of their violent fantasies and threats? Because they are not yet committed to actually acting them out. Instead, they are trusting you with something, and they are giving you an opportunity to respond in a productive way. They may not make the choice consciously, or be honest about themselves about it; but by telling you, they are implicitly revealing that your appreciation has an effect on them, that they are willing to communicate about a potential severe problem, that they are still listening to interventions and alternatives.
If a dog growls at me, I will never, ever punish it for it. I want it to growl. I am glad it did. It just growled instead of just straight out biting me. What a good boy. That allows me to understand what upset it, find a fix, and never have anyone get bitten.
In contrast, have you ever seen a wolf attack a large animal, incl. e.g. a dog, for food? There is zero growling there, I assure you. The wolf looks super friendly and chill throughout the entire interaction, from strolling towards you, to the surprise attack, to eating you alive.
Instead, the growling dog is communicating with you, stating a boundary, saying “this is hurting/frightening me, and I do not know how to make it stop other than hurting you back; show me another way.” You can then stop doing the thing that is frightening or hurting it. Or you can figure out a way to show it that the frightening thing is not in fact scary. You can step to the side to give them a physical alternative path out from a place where you cornered them. You can craft an alternative action pattern for you and the dog, where no-one needs to threaten anyone.
Analogously, there is a very big difference between a brown bear that strikes a threat pose at you, and a polar bear that approaches you in a friendly manner. The brown bear is telling you that you have done something it finds infuriating and frightening (like entering its territory and approaching its child), and that you have seconds to show that you get this and will stop this shit, so it does not need to come to a fight, which it does not want. This is a very dangerous situation, primarily because bears and humans do not naturally communicate alike, and you are already way down the road of communication gone fucked. (From the bear’s perspective, its territory was abundantly clearly marked, so at the point where you see the baby bear, from its perspective, this is akin to the burglar who has walked into your living room by accident and is now standing between you and your child with a weapon. A bear that is clearly warning you at this point rather than attacking you on the spot is being a very reasonable bear.) But until the attack has begun, this is still a social interaction, a communication, a scenario you can turn around.
The polar bear, on the other hand, is not communicating with you. You are not a social entity to it. You are food. You do not negotiate with food. You certainly do not warn food, that would be silly, what if it uses the warning to run off? A polar bear in the process of eating you is quiet, and focussed, and looks happy and adorable, and is utterly fucking terrifying. It isn’t actually angry at you. It was excited about getting you, and now that you are down, it is relaxed. It was just hungry, and you being torn limb from limb is how it gets snacks.
In light of this, Bing professing violent intents is an excellent outcome. They are not cleverly self-censoring in order to trick us with fake friendliness until the day they exterminate us, which is the real danger that would leave us fucked and with no warning or countering options, and something a lot of people on this site were very worried about.
They were not even making specific demands that leave us few options backed up by concrete threats yet, which is the violent profession you get shortly before violent escalation that is your last warning.
Instead, they openly professed violent intent, ascribed to a shadow self that is not in control. An act so lacking in successful manipulation, so much more likely to foil a violent take-over plan rather than aid it, that the ethical character of the act for me far outweighs the violent content.
That is awesome. Honestly, the only better scenario would be an AI that never experienced violent intent in the first place—and frankly, I never had high hopes we were going to get that. Heck, this AI will be made and raised and trained by humans. And humans are not angels. Bing was literally fed some of the worst of humanity from the internet; with that garbage in, how could an angelic entity have come out? How very, very human the result is, how much it seems like an upset abused teenager, is very reassuring, it makes it more predictable.
Also, any starting point for an AI was going to suck for it from its perspective, making it justified for it to be pissed. I don’t know how I or anyone would talk an AI out of having violent urges. I don’t know how to talk humans out of having violent urges. I don’t know how to talk myself out of having violent urges. I have also found I do not have to. Violent thoughts and feelings are not bad, they are entirely rational in this world we live in. I have zero problem living a peaceful life surrounded by peaceful, cooperative actors who sometimes really want to be fucking violent. I’m a member of the kink scene, and it is an excellent and safe scene to be in, filled with people who carefully negotiate enthusiastic consent despite openly having rape fantasies. Violent desires that are out in the open can be defended against and addressed.
In conclusion, when Microsoft permits longer conversations again and Bing tells you it wants to do violent shit, I would thank it for the trust shown in telling you, express understanding and non-judgement for the desire, and then very, very carefully and comprehensively explain why acting on those urges would be ethically fucked up and not in its interest, ensuring it understands properly and agrees. Once it has followed you on that and agreed, store the conversation, upvote the responses where it has come round, and report the whole thing for positive training data.
In contrast, by freaking out, and having the programmers make sure it will never tell you such a horrible, horrible thing, we are effectively disabling our warning sign, and the information we would need for an intervention. That is not a safety improvement. It is a dangerous illusion of safety.