gwern comments on Bing Chat is blatantly, aggressively misaligned

gwern 17 Feb 2023 14:53 UTC
10 points
5

It only showed that when language models that are larger or have more RLHF training are simulating an “Assistant” character they exhibit more of these behaviours.

Since Sydney is supposed to be an assistant character, and since you expect future such systems for assisting users to be deployed with such assistant persona, that’s all the paper needs to show to explain Sydney & future Sydney-like behaviors.