EDIT: Whenever I use colloquial phrases like “the AI believes a (false) X” I mean that we are using utility indifference to accomplish that goal, without actually giving the AI false beliefs.
It would be better to say, “The AI believes in falsely believing X” or “The AI believes it ought to falsely believe X” or “the AI is compelled to self-delude on X.”
It would be better to say, “The AI believes in falsely believing X” or “The AI believes it ought to falsely believe X” or “the AI is compelled to self-delude on X.”