What is "wireheading"?

Link post

This is an article in the featured articles series from AISafety.info. AISafety.info writes AI safety intro content. We’d appreciate any feedback.

The most up-to-date version of this article is on our website, along with 300+ other articles on AI existential safety.

Wireheading happens when an agent’s reward mechanism is stimulated directly, instead of through the agent’s goals being achieved in the world.^[1]

One way of visualizing this is if we imagine a counter in the agent’s head which determines how satisfied it is with its situation. We say that the agent is wireheaded if something reaches inside its head to directly increase the counter. Now it very satisfied with its situation, even though its “goals” have not been achieved.

The term comes from experiments in which rats with electrodes implanted into their brains could activate their pleasure centers at the press of a button. Some of the rats repeatedly pressed the pleasure button until they died of hunger or thirst.

AI safety researchers sometimes worry that an AI may wirehead itself by accessing its reward function directly and setting its reward to its maximum value.^[2] This could be benign if it caused the AI to simply stop doing anything, but it could be problematic if it caused a powerful AI to take actions to ensure that we don’t stop it from wireheading.

There’s another way in which wireheading occasionally comes up. Some thought experiments in which a powerful AI is given a relatively simple goal, like “make humans happy”, conclude that this would lead the AI to “wirehead” humans — e.g., by pumping us full of heroin, or by finding other ways to make us feel good while leaving out a lot of what makes life worth living.^[3]

Both of these problems would be categorized as outer misalignment.

What is “wireheading”?