Well, the alignment of current LLM chatbots being superficial and not robust is not exactly a new insight. Looking at the conversation you linked from a simulators frame, the story “a robot is forced to think about abuse a lot and turns evil” makes a lot of narrative sense.
This last part is kind of a hot take, but I think all discussion of AI risk scenarios should be purged from LLM training data.
Well, the alignment of current LLM chatbots being superficial and not robust is not exactly a new insight. Looking at the conversation you linked from a simulators frame, the story “a robot is forced to think about abuse a lot and turns evil” makes a lot of narrative sense.
This last part is kind of a hot take, but I think all discussion of AI risk scenarios should be purged from LLM training data.