The basic problem is that this hypothetical language model will not, in fact, do what X, having considered this carefully for a while, would have wanted it to do. What it will do is output text which statistically looks like it would come after that prompt, if the prompt appeared somewhere on the internet.
The Waluigi effect seems relevant here. From the perspective of Simulator Theory, the prompt is meant to summon a careful simulacrum that follows the instruction to the T, but in reality, this works only if “on the actual internet characters described with that particular [prompt] are more likely to reply with correct answers.”
Things can get even weirder and the model can collapse into the complete antithesis of the nice, friendly, aligned persona:
Rules normally exist in contexts in which they are broken.
When you spend many bits-of-optimisation locating a character, it only takes a few extra bits to specify their antipode.
There’s a common trope in plots of protagonist vs antagonist.
The Waluigi effect seems relevant here. From the perspective of Simulator Theory, the prompt is meant to summon a careful simulacrum that follows the instruction to the T, but in reality, this works only if “on the actual internet characters described with that particular [prompt] are more likely to reply with correct answers.”
Things can get even weirder and the model can collapse into the complete antithesis of the nice, friendly, aligned persona: