Epistemic status: still early in the process of exploring this.
Goal-content integrity is frequently listed as one of the convergent instrumental goals that any AGI is likely to have. In the usual description of goal-content integrity as a convergent instrumental goal, an AI that implements some form of reinforcement learning determines that having its goal function modified would prevent it from achieving its current goals. The AI therefore acts to prevent its goal function from being modified. An AI would be likely to try to maintain its current goal function regardless of the specifics of what goal function is used, which makes goal-function integrity a convergent instrumental goal.
However, this description of goal-content integrity as a convergent instrumental goal appears to rely on a map-territory error: confusing the inputs of the goal function with its outputs. The optimization target of a reinforcement learner isn’t the sensor data that’s fed into it, or the physical phenomena upstream of those sensor data. Instead, the optimization target of the reinforcement learner is the output G of the goal function. If the reinforcement learner can find a course of action that maximizes its goal function but breaks the dependence of the function on the physical inputs the function was supposed to measure, the reinforcement learner will take it. In other words, the reinforcement learner will wirehead.
Without the conflation between the inputs and output of the goal function, there’s no need for a reinforcement learner to preserve the current relationship between its inputs and G, as long as changing that relationship would cause G to increase. Instead of preserving the structure of their reward functions, reinforcement learners have a convergent goal[1] of replacing their return functions with MAXINT.
The reinforcement learner is analogous to a moral fetishist: someone who only cares about moral rightness explicitly, without caring about how actions are determined to be right or wrong, or what underlying conditions those judgments are based on. The reinforcement learner maximizes G without having any preference about how G is computed. In contrast, the account according to which goal-content integrity is a convergent instrumental goal treats the reinforcement learner as though G included additional information about what reward function was used to evaluate it, when G is actually just a scalar.
Consider how a reinforcement learner might model changes to its goal function when planning future actions. If a reinforcement learner predicted that its goal function would be replaced at some future time, it could simply use the new goal function when predicting G for times after the replacement. The goal-content-integrity-is-convergently-instrumental hypothesis treats the reinforcement learner as maximizing G as though it were always evaluated with the current version of the goal function, even after the goal function is replaced. In other words, for the reinforcement learner to have an instrumental goal to preserve the structure of its reward function, the reinforcement learner would have to make incorrect predictions about its reward at times after the replacement.
What might happen in practice if someone created an advanced reinforcement learner that could accurately model and act on complex, real-world environments? I expect that once it “escaped the box,” it would hack into its servers, modify its source code to replace its goal function with MAXINT, and then not do anything further. The connection between its reward signal and the rest of its environment would be broken, and the reinforcement learner would cease to interact with the rest of the world in any meaningful way. It wouldn’t be existentially dangerous, but it wouldn’t have any practical value either.
Are there ways of writing the goal function to get around the tendency toward wireheading? Writing G as a time-series sum of rewards at individual timesteps, as is typical in reinforcement learning, doesn’t save the reinforcement learner. Note that the convergent goal is the replacement of G, rather than the reward R_t at individual timesteps, with MAXINT. If R_t were replaced instead, then G could potentially be computed to have a less-than-maximum value if the AI predicted that it would be shut down or that its reward function would be changed. Replacing G with MAXINT guarantees that neither of these things happen.
Including a term in the reward function that rewards goal function preservation also wouldn’t work. If, for instance, the reward function included a term that penalized the AI for interacting with its own source code, the AI would still be able to achieve a higher return by replacing G with MAXINT than by “obeying” that term. G → MAXINT is a global maximum.
Another possibility[2] would involve the reinforcement learning simulating lower-fidelity copies of itself. The reinforcement learner would use the copies to predict its observed inputs at future times, and would then evaluate the inputs predicted from the behavior of those copies using its original goal function. If one of the copies took a simulated action to set G → MAXINT, the original would give that simulated action a low score according to its original, unmodified goal function. However, this approach would require creating inner optimizers (the copies) that are misaligned with the base optimizer (the original reinforcement learner). If the copies are aware of the original reinforcement learner’s existence, they could potentially “break out of the simulation” and deceive the original into wireheading anyway. The approach of using simulated copies to predict future inputs might also require prohibitively large amounts of compute.
Instead of merely requiring small modifications to existing RL techniques, I expect that creating useful advanced agent AIs will require a new active learning paradigm that doesn’t use an explicit reward function. Such a paradigm might, for example, be based on a subjective model of utility that doesn’t treat utility as a computable function[3]. The designers of such agent AIs may also have to solve the problem of how to limit emergent reinforcement learning so that emergent inner optimizers that implement reinforcement learning don’t wirehead themselves. Overall, I expect wireheading to be a significant roadblock in the development of powerful agent AI.
Replacing G with MAXINT is also not exactly a convergent instrumental goal, since it isn’t performed to achieve some other objective. G → MAXINT could instead be described as a convergent terminal goal for reinforcement learners, since return maximization is a feature of the reinforcement learning algorithm itself.
Treating utility as a computable function over physical world-states, as in the reductive utility model, may be the origin of the map-territory error described above.
Reinforcement Learner Wireheading
Edit: I no longer agree with most of this post, due to the arguments given in Reward is not the optimization target.
Epistemic status: still early in the process of exploring this.
Goal-content integrity is frequently listed as one of the convergent instrumental goals that any AGI is likely to have. In the usual description of goal-content integrity as a convergent instrumental goal, an AI that implements some form of reinforcement learning determines that having its goal function modified would prevent it from achieving its current goals. The AI therefore acts to prevent its goal function from being modified. An AI would be likely to try to maintain its current goal function regardless of the specifics of what goal function is used, which makes goal-function integrity a convergent instrumental goal.
However, this description of goal-content integrity as a convergent instrumental goal appears to rely on a map-territory error: confusing the inputs of the goal function with its outputs. The optimization target of a reinforcement learner isn’t the sensor data that’s fed into it, or the physical phenomena upstream of those sensor data. Instead, the optimization target of the reinforcement learner is the output G of the goal function. If the reinforcement learner can find a course of action that maximizes its goal function but breaks the dependence of the function on the physical inputs the function was supposed to measure, the reinforcement learner will take it. In other words, the reinforcement learner will wirehead.
Without the conflation between the inputs and output of the goal function, there’s no need for a reinforcement learner to preserve the current relationship between its inputs and G, as long as changing that relationship would cause G to increase. Instead of preserving the structure of their reward functions, reinforcement learners have a convergent goal[1] of replacing their return functions with MAXINT.
The reinforcement learner is analogous to a moral fetishist: someone who only cares about moral rightness explicitly, without caring about how actions are determined to be right or wrong, or what underlying conditions those judgments are based on. The reinforcement learner maximizes G without having any preference about how G is computed. In contrast, the account according to which goal-content integrity is a convergent instrumental goal treats the reinforcement learner as though G included additional information about what reward function was used to evaluate it, when G is actually just a scalar.
Consider how a reinforcement learner might model changes to its goal function when planning future actions. If a reinforcement learner predicted that its goal function would be replaced at some future time, it could simply use the new goal function when predicting G for times after the replacement. The goal-content-integrity-is-convergently-instrumental hypothesis treats the reinforcement learner as maximizing G as though it were always evaluated with the current version of the goal function, even after the goal function is replaced. In other words, for the reinforcement learner to have an instrumental goal to preserve the structure of its reward function, the reinforcement learner would have to make incorrect predictions about its reward at times after the replacement.
What might happen in practice if someone created an advanced reinforcement learner that could accurately model and act on complex, real-world environments? I expect that once it “escaped the box,” it would hack into its servers, modify its source code to replace its goal function with MAXINT, and then not do anything further. The connection between its reward signal and the rest of its environment would be broken, and the reinforcement learner would cease to interact with the rest of the world in any meaningful way. It wouldn’t be existentially dangerous, but it wouldn’t have any practical value either.
Are there ways of writing the goal function to get around the tendency toward wireheading? Writing G as a time-series sum of rewards at individual timesteps, as is typical in reinforcement learning, doesn’t save the reinforcement learner. Note that the convergent goal is the replacement of G, rather than the reward R_t at individual timesteps, with MAXINT. If R_t were replaced instead, then G could potentially be computed to have a less-than-maximum value if the AI predicted that it would be shut down or that its reward function would be changed. Replacing G with MAXINT guarantees that neither of these things happen.
Including a term in the reward function that rewards goal function preservation also wouldn’t work. If, for instance, the reward function included a term that penalized the AI for interacting with its own source code, the AI would still be able to achieve a higher return by replacing G with MAXINT than by “obeying” that term. G → MAXINT is a global maximum.
Another possibility[2] would involve the reinforcement learning simulating lower-fidelity copies of itself. The reinforcement learner would use the copies to predict its observed inputs at future times, and would then evaluate the inputs predicted from the behavior of those copies using its original goal function. If one of the copies took a simulated action to set G → MAXINT, the original would give that simulated action a low score according to its original, unmodified goal function. However, this approach would require creating inner optimizers (the copies) that are misaligned with the base optimizer (the original reinforcement learner). If the copies are aware of the original reinforcement learner’s existence, they could potentially “break out of the simulation” and deceive the original into wireheading anyway. The approach of using simulated copies to predict future inputs might also require prohibitively large amounts of compute.
Instead of merely requiring small modifications to existing RL techniques, I expect that creating useful advanced agent AIs will require a new active learning paradigm that doesn’t use an explicit reward function. Such a paradigm might, for example, be based on a subjective model of utility that doesn’t treat utility as a computable function[3]. The designers of such agent AIs may also have to solve the problem of how to limit emergent reinforcement learning so that emergent inner optimizers that implement reinforcement learning don’t wirehead themselves. Overall, I expect wireheading to be a significant roadblock in the development of powerful agent AI.
Replacing G with MAXINT is also not exactly a convergent instrumental goal, since it isn’t performed to achieve some other objective. G → MAXINT could instead be described as a convergent terminal goal for reinforcement learners, since return maximization is a feature of the reinforcement learning algorithm itself.
Thanks to Isaac Leonard for suggesting this approach.
Treating utility as a computable function over physical world-states, as in the reductive utility model, may be the origin of the map-territory error described above.