I agree with this in most cases. I do think there’s a distinction to be drawn between ‘user tricks the model into saying racist word’ and the ‘model aggressively gaslights user’ dialogues that have been floating around from Bing Chat—the latter seem at least closer to an alignment failure.
Yes, the difference with the Bing Chat dialogues is that A. They seem to be triggered by words and inputs not crafted by humans to make the model do the thing, and B. The typical Bing Chat behavior is orthogonal to anything Microsoft designed it to do or the model’s users seem to want from it. This makes it distinct from misuse and much more concerning from a safety standpoint.
Exactly. It depends on the level of effort required to achieve the outcome which the creator didn’t intend. If grandma would have to be drugged or otherwise put into an extreme situation before showing any violent tendencies then we don’t consider her a dangerous person. Someone else might in ideal circumstances also be peaceful, but if they can be easily provoked to violence by mild insults then it’s fair to say they’re a violent person i.e. misaligned.
Given this, I think it’s really useful to see the kinds of prompts people are using to get unintended behaviour from ChatGPT / Bing Chat. If little effort is required to provoke unwanted behaviour (unwanted from the point of view of the creators / general human values) then the model is not sufficiently aligned. It’s especially concerning if bad outcomes can be plausibly elicited by mistake, even if the specific example is found by someone searching for it.
Of course in the case of the kitchen knife, misuse is easy. Which is why we have laws around purchasing and carrying knives in public. Similarly cars, guns etc. AI applications need to prove they’re safer than a kitchen knife if they are to be used by the general public without controls. For OpenAI etc surely the point is to show that regulation is not required, rather than to achieve alignment perfection.
I agree with this in most cases. I do think there’s a distinction to be drawn between ‘user tricks the model into saying racist word’ and the ‘model aggressively gaslights user’ dialogues that have been floating around from Bing Chat—the latter seem at least closer to an alignment failure.
Yes, the difference with the Bing Chat dialogues is that A. They seem to be triggered by words and inputs not crafted by humans to make the model do the thing, and B. The typical Bing Chat behavior is orthogonal to anything Microsoft designed it to do or the model’s users seem to want from it. This makes it distinct from misuse and much more concerning from a safety standpoint.
Exactly. It depends on the level of effort required to achieve the outcome which the creator didn’t intend. If grandma would have to be drugged or otherwise put into an extreme situation before showing any violent tendencies then we don’t consider her a dangerous person. Someone else might in ideal circumstances also be peaceful, but if they can be easily provoked to violence by mild insults then it’s fair to say they’re a violent person i.e. misaligned.
Given this, I think it’s really useful to see the kinds of prompts people are using to get unintended behaviour from ChatGPT / Bing Chat. If little effort is required to provoke unwanted behaviour (unwanted from the point of view of the creators / general human values) then the model is not sufficiently aligned. It’s especially concerning if bad outcomes can be plausibly elicited by mistake, even if the specific example is found by someone searching for it.
Of course in the case of the kitchen knife, misuse is easy. Which is why we have laws around purchasing and carrying knives in public. Similarly cars, guns etc. AI applications need to prove they’re safer than a kitchen knife if they are to be used by the general public without controls. For OpenAI etc surely the point is to show that regulation is not required, rather than to achieve alignment perfection.