What is “wireheading”?

Link post

This is an article in the featured articles series from AISafety.info. The most up-to-date version of this article is on our website, along with 300+ other articles on AI existential safety.

Wireheading happens when an agent’s reward mechanism is stimulated directly, instead of through the agent’s goals being achieved in the world.[1]

The term comes from experiments in which rats with electrodes implanted into their brains could activate their pleasure centers at the press of a button. Some of the rats repeatedly pressed the pleasure button until they died of hunger or thirst.

AI safety researchers sometimes worry that an AI may wirehead itself by accessing its reward function directly and setting its reward to its maximum value.[2] This could be benign if it caused the AI to simply stop doing anything, but it could be problematic if it caused a powerful AI to take actions to ensure that we don’t stop it from wireheading.

There’s another way in which wireheading occasionally comes up. Some thought experiments in which a powerful AI is given a relatively simple goal, like “make humans happy”, conclude that this would lead the AI to “wirehead” humans — e.g., by pumping us full of heroin, or by finding other ways to make us feel good while leaving out a lot of what makes life worth living.[3]

Both of these problems would be categorized as outer misalignment.

Further reading:

  1. ^

    Another (overlapping) term for an agent interfering with its reward is reward tampering.

  2. ^

    Some, including proponents of shard theory, argue that this is unlikely to happen.

  3. ^

    At least according to most people — some pure hedonists would argue only pleasure and pain matter.

No comments.