If the examples were selected primarily to demonstrate that the chatbot does lots of things Microsoft does not want, then I would have expected the majority of examples to be things Microsoft does not want besides being hostile and threatening to users. For instance, I would guess that in practice most instances of BingGPT doing things Microsoft doesn’t want are simply hallucinating sources.
Yet instances of hostile/threatening behavior toward users are wildly overrepresented in the post (relative to the frequency I’d expect among the broader category of behaviors-Microsoft-does-not-want), which is why I thought that the examples in the post were not primarily selected just on the basis of being things Microsoft does not want.
Hostile/threatening behavior is surely a far more serious misalignment from Microsoft’s perspective than anything else, no? That’s got to be the most important thing you don’t want your chatbot doing to your customers.
The surprising thing here is not that Bing Chat is misaligned at all (e.g. that it hallucinates sources). ChatGPT did that too, but unlike Bing Chat it’s very hard to get ChatGPT to threaten you. So the surprising thing here is that Bing Chat is substantially less aligned than ChatGPT, and specifically in a hostile/threatening way that one would expect Microsoft to have really not wanted.
Hostile/threatening behavior is surely a far more serious misalignment from Microsoft’s perspective than anything else, no?
No. I’d expect the most serious misalignment from Microsoft’s perspective is a hallucination which someone believes, and which incurs material damage as a result, which Microsoft can then be sued over. Hostile language from the LLM is arguably a bad look in terms of PR, but not obviously particularly bad for the bottom line.
That said, if this was your reasoning behind including so many examples of hostile/threatening behavior, then from my perspective that at least explains-away the high proportion of examples which I think are easily misinterpreted.
No. I’d expect the most serious misalignment from Microsoft’s perspective is a hallucination which someone believes, and which incurs material damage as a result, which Microsoft can then be sued over. Hostile language from the LLM is arguably a bad look in terms of PR, but not obviously particularly bad for the bottom line.
Obviously we can always play the game of inventing new possible failure modes that would be worse and worse. The point, though, is that the hostile/threatening failure mode is quite bad and new relative to previous models like ChatGPT.
Presumably Microsoft do not want their chatbot to be hostile and threatening to its users? Pretty much all the examples have that property.
If the examples were selected primarily to demonstrate that the chatbot does lots of things Microsoft does not want, then I would have expected the majority of examples to be things Microsoft does not want besides being hostile and threatening to users. For instance, I would guess that in practice most instances of BingGPT doing things Microsoft doesn’t want are simply hallucinating sources.
Yet instances of hostile/threatening behavior toward users are wildly overrepresented in the post (relative to the frequency I’d expect among the broader category of behaviors-Microsoft-does-not-want), which is why I thought that the examples in the post were not primarily selected just on the basis of being things Microsoft does not want.
Hostile/threatening behavior is surely a far more serious misalignment from Microsoft’s perspective than anything else, no? That’s got to be the most important thing you don’t want your chatbot doing to your customers.
The surprising thing here is not that Bing Chat is misaligned at all (e.g. that it hallucinates sources). ChatGPT did that too, but unlike Bing Chat it’s very hard to get ChatGPT to threaten you. So the surprising thing here is that Bing Chat is substantially less aligned than ChatGPT, and specifically in a hostile/threatening way that one would expect Microsoft to have really not wanted.
No. I’d expect the most serious misalignment from Microsoft’s perspective is a hallucination which someone believes, and which incurs material damage as a result, which Microsoft can then be sued over. Hostile language from the LLM is arguably a bad look in terms of PR, but not obviously particularly bad for the bottom line.
That said, if this was your reasoning behind including so many examples of hostile/threatening behavior, then from my perspective that at least explains-away the high proportion of examples which I think are easily misinterpreted.
Obviously we can always play the game of inventing new possible failure modes that would be worse and worse. The point, though, is that the hostile/threatening failure mode is quite bad and new relative to previous models like ChatGPT.
Why do you think these aren’t tightly correlated? I think PR is pretty important to the bottom line for a product in the rollout phase.