Adam Jermyn comments on Smoke without fire is scary

Adam Jermyn 5 Oct 2022 20:42 UTC
3 points
1
Sorry I mean a prompt that incentivizes employing capabilities relevant for deceptive alignment. For instance, you might have a prompt that suggests to the model that it act non-myopically, or that tries to get at how much it understands about its own architecture, both of which seem relevant to implementing deceptive alignment.