The problem of figuring out preference without wireheading seems very similar to the problem of maintaining factual knowledge about the world without suffering from appeals to consequences. In both cases a specialized part of agent design (model of preference or model of a fact in the world) has a purpose (accurate modeling of its referent) whose pursuit might be at odds with consequentialist decision making of the agent as a whole. The desired outcome seems to involve maintaining integrity of the specialized part, resisting corruption of consequentialist reasoning.
With this analogy, it might be possible to transfer lessons from the more familiar problem of learning facts about the world, to the harder problem of protecting preference.
The problem of figuring out preference without wireheading seems very similar to the problem of maintaining factual knowledge about the world without suffering from appeals to consequences. In both cases a specialized part of agent design (model of preference or model of a fact in the world) has a purpose (accurate modeling of its referent) whose pursuit might be at odds with consequentialist decision making of the agent as a whole. The desired outcome seems to involve maintaining integrity of the specialized part, resisting corruption of consequentialist reasoning.
With this analogy, it might be possible to transfer lessons from the more familiar problem of learning facts about the world, to the harder problem of protecting preference.