MiguelDev comments on Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research

MiguelDev 10 Aug 2023 14:14 UTC
0 points
−6
First, demonstrate each subcomponent above in isolation. E.g., if we’re trying to demonstrate that treacherous turns are possible, but models lack some relevant aspect of situational awareness, then include the relevant information about the model’s situation in the prompt.

′ petertodd’ (the glitch token) is a case that threacherous turns are possible and was out in the wild until OpenAI patched it in Feb. 2023.