Dr_Manhattan comments on Bing Chat is blatantly, aggressively misaligned

Dr_Manhattan 16 Feb 2023 15:48 UTC
3 points
1
A net saying “I’m thinking about ways to kill you” does not necessarily imply anything whatsoever about the net actually planning to kill you
Since these nets are optimized for consistency (as it makes textual output more likely), wouldn’t outputting text that is consistent with this “thought” be likely? E.g. convincing the user to kill themselves, maybe giving them a reason (by searching the web)?