Defining AI wireheading
What does it mean for an AI to wirehead its reward function? We’re pretty clear on what it means for a human to wirehead—artificial stimulation of part of the brain rather than genuine experiences—but what does it mean for an AI?
We have a lot of examples of wireheading, especially in informal conversation (and some specific prescriptive examples which I’ll show later). So, given those examples, can we define wireheading well—cut reality at its joints? The definition won’t be—and can’t be—perfectly sharp, but it should allow us to have clear examples of what is and what isn’t wireheading, along with some ambiguous intermediate cases.
Intuitive examples
Suppose we have a weather-controlling AI whose task is to increase air pressure; it gets a reward for so doing.
What if the AI directly rewrites its internal reward counter? Clearly wireheading.
What if the AI modifies the input wire for that reward counter? Clearly wireheading.
What if the AI threatens the humans that decide on what to put on that wire? Clearly wireheading.
What if the AI takes control of all the barometers of the world, and sets them to record high pressure? Clearly wireheading.
What if the AI builds small domes around each barometer, and pumps in extra air? Clearly wireheading.
What if the AI fills the atmosphere with CO₂ to increase pressure that way? Clearly wire… actually, that’s not so clear at all. This doesn’t seem a central example of wireheading. It’s a failure of alignment, yes, but it doesn’t seem to be wireheading.
Thus not every example of edge or perverse instantiation is an example of wireheading.
Prescriptivist wireheading, and other definitions
A lot of posts and papers (including some of mine) take a prescriptivist approach to wireheading.
They set up a specific situation (often with a causal diagram), and define a particular violation of some causal assumptions as wireheading (eg “if the agent changes the measured value without changing the value of , which is being measured, that’s wireheading”).
And that is correct, as far as it goes. But it doesn’t cover all the possible examples of wireheading.
Conversely, this post defines wireheading as a divergence between a true utility and a substitute utility (calculated with respect to a model of reality).
This is too general, almost as general as saying that every Goodhart curse is an example of wireheading.
Note, though, that the converse is true: every example of wireheading is a Goodhart curse. That’s because every example of wireheading is maximising a proxy, rather than the intended objective.
The definition
The most intuitive example of wireheading is that there is some property of the world that we want to optimise, and that there is some measuring system that estimates that property. If the AI doesn’t optimise the property, but instead takes control of the measuring system, that’s wireheading (bonus points if the measurements the AI manipulates go down an actual wire).
This re-emphasises that “wireheading is in the eye of the beholder”: if our true goal is actually the measuring system (maybe our AI is in competition with another one to maximise a score in a game, and we really don’t care how it does this), then there will be no wireheading, just an AI following a correct objective.
Thus wireheading is always a failure of some (implicit or explicit) goal; thus every example of wireheading is a failure of value alignment, though the converse is not true.
Also key to the definition is the fact that the measuring system is, in some sense “much smaller” than whatever property of the system it is measuring. Pumping out CO₂ is not the correct instantiation of some goal along the lines of “increase air pressure so humans enjoy better weather”; but nor is it merely manipulating the measurement of that goal.
The definition
Thus we can define wireheading as:
Given some implicit goal G, an agent wireheads if, instead of moving towards G, it manipulates some narrow measurement channel that is intended to measure G, but will fail to do so after the agent’s manipulation.
The difference with the prescriptivist approach is that the measurement channel is not specified; instead, we ask whether we can usefully characterise some feature of the setup as a “narrow measurement channel”, and then apply the definition.
This can be seen as a particular failure of abstraction: the abstract goal G was collapsed to the output of the measurement channel.
Examples, counter-examples, and edge cases
Under this definition, all the intuitive examples of wireheading above fit: the measurement channel the AI takes over (its internal counter, the wire going into it, the statements made by humans, the barometers, the immediate environments of the barometers) is always much smaller than the whole atmosphere, which was its intended goal.
And that’s why the CO₂ example isn’t wireheading: the AI is doing a massive manipulation of the world, on the same scale as its intended goal; it isn’t just manipulating the measurement channel[1].
The case of the domes around the barometers is an interesting one to consider. Obviously, if the AI put a dome around the planet and pumped in extra air, this wouldn’t count as wireheading. Thus, we can imagine the domes growing bigger and merging, thus giving a smooth transition from “clearly wireheading” to “clearly not wireheading”, and showing that ambiguous cases must exist.
We can also produce examples of Goodhart curse that are not wireheading. Take the practice of “teaching to the test”. In this case, there is a simple objective (the test results) and the school acts to optimise for that objective. However, in typical schools this is not wireheading; teaching to the test involves drilling students in specific skills, training them, and having them memorise certain facts. Though these are done specifically to pass the test, these are the kinds of actions that a teacher would undertake anyway. One can talk about how this “narrows” the intellect, but, except in extreme cases, this cannot be characterised as gaining control of a narrow measurement channel.
For an interesting edge case, consider the RL agent playing the game CoastRunners. As described here, the score-maximising agent misbehaved in an interesting way: instead of rushing to complete the level with the highest score possible, the agent instead found a way to boat in circles, constantly hitting the same targets and ever increasing its score.
Is that wireheading? Well, it’s certainly Goodhart: there is a discrepancy between the implicit goals (got round the course fast, hitting targets) and the explicit (maximise the score). But do we feel that the agent has control of a “narrow” measurement channel?
I’d argue that it’s probably not the case for CoastRunners. The “world” for this agent is not a particularly rich one; going round and round and hitting targets is what the agent is intended to do; it has just found an unusual way of doing so.
If, instead, this behaviour happened in some subset of a much richer game (say, SimCity), then we might see it more naturally as wireheading. The score there is intended to measure a wider variety of actions (building and developing a virtual city while balancing tax revenues, population, amenities, and other aspects of the city), so “getting a high score while going round in circles” is much closer to “controlling a measurement channel that is narrow (as compared to the implicit goal)” than in the CoastRunners situation.
But, this last example can illustrate the degree of judgement and ambiguity that can exist when identifying wireheading in some situations.
- ↩︎
Note that the CO₂ example can fit with the definition of this post. One just needs to imagine that the agent’s model does not specify the gaseous content of the air in sufficient detail to exclude a CO₂-rich air as a solution to the goal.
This illustrates that the definition used in that post doesn’t fully capture wireheading.
- How do we become confident in the safety of a machine learning system? by 8 Nov 2021 22:49 UTC; 133 points) (
- AI Alignment 2018-19 Review by 28 Jan 2020 2:19 UTC; 126 points) (
- Model splintering: moving from one imperfect model to another by 27 Aug 2020 11:53 UTC; 79 points) (
- Updating Utility Functions by 9 May 2022 9:44 UTC; 41 points) (
- [AN #75]: Solving Atari and Go with learned game models, and thoughts from a MIRI employee by 27 Nov 2019 18:10 UTC; 38 points) (
- Beyond the human training distribution: would the AI CEO create almost-illegal teddies? by 18 Oct 2021 21:10 UTC; 36 points) (
- AI Safety 101 : Reward Misspecification by 18 Oct 2023 20:39 UTC; 30 points) (
- Value extrapolation partially resolves symbol grounding by 12 Jan 2022 16:30 UTC; 24 points) (
- If you don’t design for extrapolation, you’ll extrapolate poorly—possibly fatally by 8 Apr 2021 18:10 UTC; 17 points) (
- Value extrapolation, concept extrapolation, model splintering by 8 Mar 2022 22:50 UTC; 16 points) (
- Finding the multiple ground truths of CoinRun and image classification by 8 Dec 2021 18:13 UTC; 15 points) (
- The blue-minimising robot and model splintering by 28 May 2021 15:09 UTC; 13 points) (
- AI Safety 101 : Reward Misspecification by 21 Dec 2023 14:26 UTC; 6 points) (EA Forum;
- 22 Jan 2020 14:09 UTC; 3 points) 's comment on Wireheading is in the eye of the beholder by (
- 23 Dec 2024 6:38 UTC; 1 point) 's comment on The blue-minimising robot and model splintering by (
- A Proposal for AI Alignment: Using Directly Opposing Models by 27 Apr 2023 18:05 UTC; 0 points) (
Thanks Stuart, nice post.
I’ve moved away from the wireheading terminology recently, and instead categorize the problem a little bit differently:
The top-level category is reward hacking / reward corruption, which means that the agent’s observed reward differs from true reward/task performance.
Reward hacking has two subtypes, depending on whether the agent exploited a misspecification in the process that computes the rewards, or modified the process. The first type is reward gaming and the second reward tampering.
Tampering can subsequently be divided into further subcategories. Does the agent tamper with its reward function, its observations, or the preferences of a user giving feedback? Which things the agent might want to tamper with depends on how its observed rewards are computed.
One advantage with this terminology is that it makes it clearer what we’re talking about. For example, its pretty clear what reward function tampering refers to, and how it differs from observation tampering, even without consulting a full definition.
That said, I think you’re post nicely puts the finger on what we usually mean when we say wireheading, and it is something we have been talking about a fair bit. Translated into my terminology, I think your definition would be something like “wireheading = tampering with goal measurement”.
Seems like the idea is that wireheading denotes specification gaming that is egregious in its focus on the measurement channel. I’m inclined to agree..
Where “measurement channel” not just one specific channel, but anything that has the properties of a measurement channel.
In my usage, “wireheading” is generally about the direct change of a reward value which bypasses the utility function which is supposed to map experience to reward. It’s a subset of Goodhart, which also can cover cases of misuse of the of reward mapping (eating ice cream instead of healthier food), or changing of the mapping function itself (intentionally acquiring a taste for something).
But really, what’s the purpose of trying to distinguish wireheading from other forms of reward hacking? The mitigations for Goodhart are the same: ensure that there is a reward function that actually matches real goals, or enough functions with declining marginal weight that abusing any of them is self-limiting.
Because mitigations for different failure modes might not be the same, depending on the circumstances.
I consider wireheading to be a special case of proxy alignment in a mesaoptimiser.
Suppose the base objective was to increase atmospheric pressure. One effect of increased atmospheric pressure is that less cosmic radiation reaches the ground, (more air to block it). So an AI whose mesa goal was to protect earth from radiation would be a proxy aligned agent. It has the failure mode of surrounding earth in an iron shell to block radiation. Note that this failure can happen whether or not the AI has any radiation sensors. An agent that wants to protect earth from radiation did well enough on the training, and now that is what it will do, protect the earth from radiation.
An agent with the mesa goal of maximizing pressure near all barometers would put them all in a pressure dome. (Or destroy all barometers and drop one “barometer” into the core of Jupiter.)
An agent with the mesa goal of maximizing the reading on all barometers would be the same. That agent will go around breaking all the worlds barometers.
Another mesa objective that you could get is to maximize the number on this reward counter in this computer chip here.
Wireheading is a special case of a proxy aligned mesa optimizer where the mesa objective is something to do with the agents own workings.
As with most real world categories, “something to do with” is a fuzzy concept. There are mesa objectives that are clear instances of wireheading, and ones that are clearly not and borderline cases. This is about word definitions, not real world uncertainty.
If anyone can describe a situation in which wireheading would occur that wasn’t a case of mesa optimiser misalignment, then I would have to rethink this. (Obviously you can build an agent with the hard coded goal of maximizing some feature of its own circuitry, with no mesa optimization.)
I agree. I’ve now added this line, which I thought I’d put in the original post, but apparently missed out:
Planned summary:
Planned opinion:
The domes growing bigger and merging does not indicate a paradox of the heap because the function mapping each utility function to its optimal policy is not continuous. There is no reasonably simple utility function between one that would construct small domes and one that would construct one large dome, which would construct medium sized domes.