Sorry I mean a prompt that incentivizes employing capabilities relevant for deceptive alignment. For instance, you might have a prompt that suggests to the model that it act non-myopically, or that tries to get at how much it understands about its own architecture, both of which seem relevant to implementing deceptive alignment.
Sorry I mean a prompt that incentivizes employing capabilities relevant for deceptive alignment. For instance, you might have a prompt that suggests to the model that it act non-myopically, or that tries to get at how much it understands about its own architecture, both of which seem relevant to implementing deceptive alignment.