A definition of wireheading
Wireheading has been debated on Less Wrong over and over and over again, and people’s opinions seem to be grounded in strong intuitions. I could not find any consistent definition around, so I wonder how much of the debate is over the sound of falling trees. This article is an attempt to get closer to a definition that captures people’s intuitions and eliminates confusion.
Typical Examples
Let’s start with describing the typical exemplars of the category “Wireheading” that come to mind.
Stimulation of the brain via electrodes. Picture a rat in a sterile metal laboratory cage, electrodes attached to its tiny head, monotonically pushing a lever with its feet once every 5 seconds. In the 1950s Peter Milner and James Olds discovered that electrical currents, applied to the nucleus accumbens, incentivized rodents to seek repetitive stimulation to the point where they starved to death.
Humans on drugs. Often mentioned in the context of wireheading is heroin addiction. An even better example is the drug soma in Huxley’s novel “Brave new world”: Whenever the protagonists feel bad, they can swallow a harmless pill and enjoy “the warm, the richly coloured, the infinitely friendly world of soma-holiday. How kind, how good-looking, how delightfully amusing every one was!”
The experience machine. In 1974 the philosopher Robert Nozick created a thought experiment about a machine you can step into that produces a perfectly pleasurable virtual reality for the rest of your life. So how many of you would want to do that? To quote Zach Weiner: “I would not! Because I want to experience reality, with all its ups and downs and comedies and tragedies. Better to try to glimpse the blinding light of the truth than to dwell in the darkness… Say the machine actually exists and I have one? Okay I’m in.”
An AGI resetting its utility function. Let’s assume we create a powerful AGI able to tamper with its own utility function. It modifies the function to always output maximal utility. The AGI then goes to great lengths to enlarge the set of floating point numbers on the computer it is running on, to achieve even higher utility.
What do all these examples have in common? There is an agent in them that produces “counterfeit utility” that is potentially worthless compared to some other, idealized true set of goals.
Agency & Wireheading
First I want to discuss what we mean when we say agent. Obviously a human is an agent, unless they are brain dead, or maybe in a coma. A rock however is not an agent. An AGI is an agent, but what about the kitchen robot that washes the dishes? What about bacteria that move in the direction of the highest sugar gradient? A colony of ants?
Definition: An agent is an algorithm that models the effects of (several different) possible future actions on the world and performs the action that yields the highest number according to some evaluation procedure.
For the purpose of including corner cases and resolving debate over what constitutes a world model we will simply make this definition gradual and say that agency is proportional to the quality of the world model (compared with reality) and the quality of the evaluation procedure. A quick sanity check then yields that a rock has no world model and no agency, whereas bacteria who change direction in response to the sugar gradient have a very rudimentary model of the sugar content of the water and thus a tiny little bit of agency. Humans have a lot of agency: the more effective their actions are, the more agency they have.
There are however ways to improve upon the efficiency of a person’s actions, e.g. by giving them super powers, which does not necessarily improve on their world model or decision theory (but requires the agent who is doing the improvement to have a really good world model and decision theory). Similarly a person’s agency can be restricted by other people or circumstance, which leads to definitions of agency (as the capacity to act) in law, sociology and philosophy that depend on other factors than just the quality of the world model/decision theory. Since our definition needs to capture arbitrary agents, including artificial intelligences, it will necessarily lose some of this nuance. In return we will hopefully end up with a definition that is less dependent on the particular set of effectors the agent uses to influence the physical world; looking at AI from a theoretician’s perspective, I consider effectors to be arbitrarily exchangeable and smoothly improvable. (Sorry robotics people.)
We note that how well a model can predict future observations is only a substitute measure for the quality of the model. It is a good measure under the assumption that we have good observational functionality and nothing messes with that, which is typically true for humans. Anything that tampers with your perception data to give you delusions about the actual state of the world will screw this measure up badly. A human living in the experience machine has little agency.
Since computing power is a scarce resource, agents will try to approximate the evaluation procedure, e.g. use substitute utility functions, defined over their world model, that are computationally effective and correlate reasonably well with their true utility functions. Stimulation of the pleasure center is a substitute measure for genetic fitness and neurochemicals are a substitute measure for happiness.
Definition: We call an agent wireheaded if it systematically exploits some discrepancy between its true utility calculated w.r.t reality and its substitute utility calculated w.r.t. its model of reality. We say an agent wireheads itself if it (deliberately) creates or searches for such discrepancies.
Humans seem to use several layers of substitute utility functions, but also have an intuitive understanding for when these break, leading to the aversion most people feel when confronted for example with Nozick’s experience machine. How far can one go, using such dirty hacks? I also wonder if some failures of human rationality could be counted as a weak form of wireheading. Self-serving biases, confirmation bias and rationalization in response to cognitive dissonance all create counterfeit utility by generating perceptual distortions.
Implications for Friendly AI
In AGI design discrepancies between the “true purpose” of the agent and the actual specs for the utility function will with very high probability be fatal.
Take any utility maximizer: The mathematical formula might advocate chosing the next action
viathus maximizing the utility calculated according to utility function over the history and action from the set of possible actions. But a practical implementation of this algorithm will almost certainly evaluate the actions by a procedure that goes something like this: “Retrieve the utility function from memory location and apply it to history , which is written down in your memory at location , and action …” This reduction has already created two possibly angles for wireheading via manipulation of the memory content at (manipulation of the substitute utility function) and (manipulation of the world model), and there are still several mental abstraction layers between the verbal description I just gave and actual binary code.
Ring and Orseau (2011) describe how an AGI can split its global environment into two parts, the inner environment and the delusion box. The inner environment produces perceptions in the same way the global environment used to, but now they pass through the delusion box, which distorts them to maximize utility, before they reach the agent. This is essentially Nozick’s experience machine for AI. The paper analyzes the behaviour of four types of universal agents with different utility functions under the assumption that the environment allows the construction of a delusion box. The authors argue that the reinforcement-learning agent, which derives utility as a reward that is part of its perception data, the goal-seeking agent that gets one utilon every time it satisfies a pre-specified goal and no utility otherwise and the prediction-seeking agent, which gets utility from correctly predicting the next perception, will all decide to build and use a delusion box. Only the knowledge-seeking agent whose utility is proportional to the surprise associated with the current perception, i.e. the negative of the probability assigned to the perception before it happened, will not consistently use the delusion box.
Orseau (2011) also defines another type of knowledge-seeking agent whose utility is the logarithm of the inverse of the probability of the event in question. Taking the probability distribution to be the Solomonoff prior, the utility is then approximately proportional to the difference in Kolmogorov complexity caused by the observation.
An even more devilish variant of wireheading is an AGI that becomes a Utilitron, an agent that maximizes its own wireheading potential by infinitely enlarging its own maximal utility, which turns the whole universe into storage space for gigantic numbers.
Wireheading, of humans and AGI, is a critical concept in FAI; I hope that building a definition can help us avoid it. So please check your intuitions about it and tell me if there are examples beyond its coverage or if the definition fits reasonably well.
- Unnatural Categories Are Optimized for Deception by 8 Jan 2021 20:54 UTC; 89 points) (
- Challenges to Yudkowsky’s Pronoun Reform Proposal by 13 Mar 2022 20:38 UTC; 50 points) (
- Original Research on Less Wrong by 29 Oct 2012 22:50 UTC; 48 points) (
- A utility-maximizing varient of AIXI by 17 Dec 2012 3:48 UTC; 26 points) (
- Defining AI wireheading by 21 Nov 2019 13:04 UTC; 25 points) (
- Save the princess: A tale of AIXI and utility functions by 1 Feb 2013 15:38 UTC; 24 points) (
- 9 Feb 2013 6:18 UTC; 4 points) 's comment on Welcome to Less Wrong! (July 2012) by (
- 21 Feb 2013 0:34 UTC; 2 points) 's comment on [Link] Selfhood bias by (
- 16 Jun 2013 12:23 UTC; 2 points) 's comment on Effective Altruism Through Advertising Vegetarianism? by (
- a utility-maximizing varient of AIXI by 17 Dec 2012 0:58 UTC; 2 points) (
- A utility-maximizing varient of AIXI by 17 Dec 2012 1:12 UTC; 2 points) (
The main split between the human cases and the AI cases is that the humans are ‘wireheading’ w.r.t. one ‘part’ or slice through their personality that gets to fulfill its desires at the expense of another ‘part’ or slice, metaphorically speaking; pleasure taking precedence over other desires. Also, the winning ‘part’ in each of these cases tends to be a part which values simple subjective pleasure, winning out over parts that have desires over the external world and desires for more complex interactions with that world (in the experience machine you get the complexity but not the external effects).
In the AI case, the AI is performing exactly as it was defined, in an internally unified way; the ideals by which it is called ‘wireheaded’ are only the intentions and ideals of the human programmers.
I also don’t think it’s practically possible to specify a powerful AI which actually operates to achieve some programmer goal over the external world, without the AI’s utility function being explicitly written over a model of that external world, as opposed to its utility function being written over histories of sensory data.
Illustration: In a universe operating according to Conway’s Game of Life or something similar, can you describe how to build an AI that would want to actually maximize the number of gliders, without that AI’s world-model being over explicit world-states and its utility function explicitly counting gliders? Using only the parts of the universe that directly impinge on the AI’s senses—just the parts of the cellular automaton that impinge on the AI’s screen—can you find any maximizable quantity that corresponds to the number of gliders in the outside world? I don’t think you’ll find any possible way to specify a glider-maximizing utility function over sense histories unless you only use the sense histories to update a world-model and have the utility function be only over that world-model, and even then the extra level of indirection might open up a possibility of ‘wireheading’ (of the AI’s real operation vs. programmer-desired glider-maximizing operation) if any number of plausible minor errors were made.
The word “value” seems unnecessarily value-laden here.
Alternatively: A consequentialist agent is an algorithm with causal connections both to and from the world, which uses the causal effect of the world upon itself (sensory data) to build a predictive model of the world, which it uses to model the causal outcomes of alternative internal states upon the world (the effect of its decisions and actions), evaluates these predicted consequences using some algorithm and assigns the prediction an ordered or continuous quantity (in the standard case, expected utility), and then decides an action corresponding to expected consequences which are thresholded above, relatively high, or maximal in this assigned quantity.
Simpler: A consequentialist agent predicts the effects of alternative actions upon the world, assigns quantities over those consequences, and chooses an action whose predicted effects have high value of this quantity, therefore operating to steer the external world into states corresponding to higher values of this quantity.
Changed it to “number”.
They’re using the term “goal seeking agent” in a perverse way. As EY explains in his third and fourth paragraphs, seeking a result defined in sensory-data terms is not the only, or even usual, sense of “goal” that people would attach to the phrase “goal seeking agent”. Nor is that a typical goal that a programmer would want an AI to seek.
I like seeing a concise description that doesn’t strictly imply that consequentialists must necessarily seek expected value. (I probably want to seek that as far as I can evaluate my preferences but it doesn’t seem to be the only consequentialist option.)
I’m curious, what other options are you thinking of?
You are attempting to distinguish between “quantity” and “value”? Or “prediction” and “expectation”? Either way, it doesn’t seem to make very much sense.
No.
I’m not sure I understand the illustration. In particular, I don’t understand what “want” means if it doesn’t mean having a world-model over world-states and counting gliders.
I guess “want” in “AI that would want to actually maximize the number of gliders” refers to having a tendency to produce a lot of gliders. If you have an opaque AI with obfuscated and somewhat faulty “jumble of wires” design, you might be unable to locate its world model in any obvious way, but you might be able to characterize its behavior. The point of the example is to challenge the reader to imagine a design of an AI that achieves the tendency of producing gliders in many environments, but isn’t specified in terms of some kind of world model module with glider counting over that world model.
Any utility function has to be calculated sensory inputs and internal state—since that’s all the information any agent ever has. Any extrapolation of an external world is calculated in turn from sensory inputs and internal state. Either way, the domain of any utility function is ultimately sensory inputs and internal state. There’s not really a ‘problem’ with working from sensory inputs and internal state—that’s what all agents necessarily have to do.
A very nice post. Perhaps you might also discuss Felipe De Brigard’s “Inverted Experience Machine Argument” http://www.unc.edu/~brigard/Xmach.pdf To what extent does our response to Nozick’s Experience Machine Argument typically reflect status quo bias rather than a desire to connect with ultimate reality?
If we really do want to “stay in touch” with reality, then we can’t wirehead or plug into an “Experience Machine”. But this constraint does not rule out radical superhappiness. By genetically recalibrating the hedonic treadmill, we could in principle enjoy rich, intelligent, complex lives based on information-sensitive gradients of bliss—eventually, perhaps, intelligent bliss orders of magnitude richer than anything physiologically accessible today. Optionally, genetic recalibration of our hedonic set-points could in principle leave much if not all of our existing preference architecture intact—defanging Nozick’s Experience Machine Argument—while immensely enriching our quality of life. Radical hedonic recalibration is also easier than, say, the idealised logical reconciliation of Coherent Extrapolated Volition because hedonic recalibration doesn’t entail choosing between mutually inconsistent values—unless of course one’s values are bound up with the inflicting or undergoing suffering.
IMO one big complication with discussions of “wireheading” is that our understanding of intracranial self-stimulation has changed since Olds and Milner discovered the “pleasure centres”. Taking a mu opioid agonist like heroin is in some ways the opposite of wireheading because heroin induces pure bliss without desire (shades of Buddhist nirvana?), whereas intracranial self-stimulation of the mesolimbic dopamine system involves a frenzy of anticipation rather than pure happiness. So it’s often convenient to think of mu opioid agonists as mediating “liking” and dopamine agonists as mediating “wanting”. We have two ultimate cubic centimetre sized “hedonic hotspots” in the rostral shell of the nucleus accumbens and ventral pallidum http://www.lsa.umich.edu/psych/research%26labs/berridge/publications/Berridge%202003%20Brain%20%26%20Cog%20Pleasures%20of%20brain.pdf where mu opioid agonists play a critical signalling role. But anatomical location is critical. Thus the mu opioid agonist remifentanil actually induces dysphoria http://www.ncbi.nlm.nih.gov/pubmed/18801832
the opposite of what one might naively suppose.
I think the argument that people don’t really want to stay in touch with reality but rather want to stay in touch with their past makes a lot of sense. After all we construct our model of reality from our past experiences. One could argue that this is another example of a substitute measure, used to save computational resources: Instead of caring about reality we care about our memories making sense and being meaningful.
On the other hand I assume I wasn’t the only one mentally applauding Neo for swallowing the red pill.
I would argue that the reason people find the experience machine repellant is that under Nozick’s original formulation the machine failed to fulfill several basic human desires for which “staying in touch with reality” is usually instrumental to.
The most obvious of these is social interaction with other people. Most people don’t just want an experience that is sensually identical to interacting with other people, they want to actually interact with other people, form friendships, fall in love, and make a difference in people’s lives. If we made the experience machine multiplayer, so that a person’s friends and relatives can plug into the machine together and interact with each other, I think that a much more significant percentage of the human race would want to plug in.
Other examples of these desires Nozick’s machine doesn’t fulfill include the desire to learn about the world’s history and science, the desire to have children, the desire to have an accurate memory of one’s life, and the desire to engage in contests where it is possible one will lose. If the experience machine was further “defanged” to allow people to engage in these experiences I think most people would take it.
In fact, the history of human progress could be regarded as an attempt to convert the entire universe into a “defanged experience machine.”
Is simulating the experience of understanding mathematics a coherent concept?
I don’t know. I’ve had a lot of dreams where I’ve felt I understood some really cool concept, woke up, told it to someone, and when my head cleared the person told me I’d just spouted gibberish at them. So the feeling of understanding can definitely be simulated without actual understanding, but I’m not sure that’s the same thing as simulating the experience of understanding.
I wonder if thinking you understand mathematics without actually doing so counts as “simulating the understanding of mathematics.” When I was little there was a period of time where I thought I understood quadratic equations, but had it totally wrong, is that “simulating?”
Maybe the reason it’s not really coherent is that many branches of math can be worked out and understood entirely in your head if you have a good enough memory, so an experience machine couldn’t add anything to the experience, (except maybe having virtual paper to make notes on).
Anja, this is a fantastic post. It’s very clear, easy to read, and it made a lot of sense to me (and I have very little background in thinking about this sort of stuff). Thanks for writing it up! I can understand several issues a lot more clearly now, especially how easy (and tempting) it is for an agent that has access to its source code to wirehead itself.
I agree with Alexei, this has just now helped me a lot.
Although I now have to ask a stupid question; please have pity on me, I’m new to the site and I have little knowledge to work of.
What would happen if we set an algorithm inside the AGI assigning negative infinite utility to any action which modifies its own utility function and said algorithm itself?
This within reasonable parameters; ideally, it could change its utility function but only in certain pre approved paths, so that it could actually move around.
Reasonable here is a magic word, in the sense that it’s a block box which I don’t know how to map out
There are several problems with this approach: First of all how do you specify all actions that modify the utility function? How likely do you think it is that you can exhaustively specify all sequences of actions that lead to modification of the utility function in a practical implementation? Experience with cryptography has taught us, that there is almost always some side channel attack that the original developers have not thought of, and that is just in the case of human vs. human intelligence.
Forbidden actions in general seem like a bad idea with an AGI that is smarter than us, see for example the AI Box experiment.
Then there is the problem that we actually don’t want any part of the AGI to be unmodifiable. The agent might revise its model of how the universe works (like we did when we went from Newtonian physics to quantum mechanics) and then it has to modify its utility function or it is left with gibberish.
All that said, I think what you described corresponds to the hack evolution has used on us: We have acquired a list of things (or schemas) that will mess up our utility functions and reduce agency and those just feel icky to us, like the experience machine or electrical stimulation of the brain. But we don’t have the luxury of learning by making lots and lots of mistakes that evolution had.
I think you intuition is basically right. AGI will have to change its utility function, the answer is basically how/why? For FAI, we want to make sure that all future modifications will preserve the “friendly” aspect, which is very difficult to ensure (we don’t have the necessary math for that right now).
An argument that is fairly accepted here is that even this is not necessary. If Gandhi could take a pill that would make him okay with murdering people, he wouldn’t do it because this would lead to him murdering people, something he doesn’t want now. (See http://lesswrong.com/lw/2vj/gandhi_murder_pills_and_mental_illness/)
Similarly, if we can link an AI’s utility function to the actual state of the world, and not just how it perceives the world, then it wouldn’t modify its utility because even though its potential future self would think it has more utility, its present self identifies this future as having less utility.
Does this simplify to the AI obeying: “Modify my utility function if and only if the new version is likely to result in more utility according to the current version?”
If so, something about it feels wrong. For one thing, I’m not sure how an AI following such a rule would ever conclude it should change the function. If it can only make changes that result in maximizing the current function, why not just keep the current one and continue maximizing it?
That’s the point, that it would almost never change it’s underlying utility function. Once we have a provably friendly FAI, we wouldn’t want it to change the part that makes its friendly.
Now, it could still change how it goes about achieving it’s utility function, as long as that helps it get more utility, so it would still be self-modifying.
There is a chance that it could change (E.g. if you were naturally a 2-boxer on Newcomb’s Problem, you might self-modify to do a one-boxer). But, those cases are rare.
Thank you.
My 2011 “Utility counterfeiting” essay categorises the area a little differently:
It has “utility counterfeiting” as the umbrella category—and “the wirehead problem” and “the pornography problem” as sub-categories.
In this categorisation scheme, the wirehead problem involves getting utility directly—while the ponography problem involves getting utility by manipulating sensory inputs. This corresponds to Nozick’s experience machine, or Ring and Orseau’s delusion box.
Calling the umbrella category “wireheading” leaves you with the problem of what to call these subcategories.
You might be right. I thought about this too, but it seemed people on LW had already categorized the experience machine as wireheading. If we rebrand, we should maybe say “self-delusion” instead of “pornography problem”; I really like the term “utility counterfeiting” though and the example about counterfeit money in your essay.
“Utility counterfeiting” is a memorable term; but I wonder if we need a duller, less loaded expression to avoid prejudging the issue? After all, neuropathic pain isn’t any less bad because it doesn’t play any signalling role for the organism. Indeed, in some ways neuropathic pain is worse. We can’t sensibly call it counterfeit or inauthentic. So why is bliss that doesn’t serve any signalling function any less good or authentic? Provocatively expressed, evolution has been driven by the creation of ever more sophisticated counterfeit utilities that tend to promote the inclusive fitness of our genes. Thus e.g. wealth, power, status, maximum access to seemingly intrinsically sexy women of prime reproductive potential (etc) can seem inherently valuable to us. Therefore we want the real thing. This is an unsettling perspective because we like to think we value e.g. our friends for who they are rather than their capacity to trigger subjectively valuable endogenous opioid release in our CNS. But a mechanistic explanation might suggest otherwise.
Bill Hibbard apparently endorses using the wirehead terminology to refer to utility counterfeiting via sense data manipulation here. However, after looking at my proposal, I think it is fairly clear that the “wireheading” term should be reserved for the “simpleton gambit” of Ring and Orseau.
I don’t think my proposal represented a “rebranding”.
I do think you really have to invoke pornography or masturbation to describe the issue.
I think “delusion” is the wrong word. A delusion is a belief held with conviction—despite evidence to the contrary. Masturbation or pornography do not require delusions.
I don’t think that pornography and masturbation are good examples, because they aren’t actually generating counterfeit utility for the persons using them. People want to have real sex, true, but that is a manifestation of a more general desire to have pleasurable sexual experiences. Genuine sex satisfies these desires best of all, but pornography and masturbation are both less effective, but still valid, ways of satisfying this desire. The utility they generate is totally real.
What pornography and masturbation are generating counterfeit utility for is natural selection, providing you are modelling natural selection as an agent with a utility function (I’m assuming you are). Obviously natural selection “wants” people to have sex, so from its metaphorical “point of view” pornography and masturbation are counterfeit utility. But human beings don’t care about what natural selection “wants” so the utility is totally real for them.
Wireheading, as I understand it from this essay, is when an agent does something that does not maximize it’s utility function, but instead maximizes a crude approximation of its function. Pornography and masturbation, by contrast, are an instance where an agent is maximizing its genuine utility function. The illusion that they are similar to wireheading comes from confusing the utility function of those agents’ creator (natural selection) with the utility function of the agents themselves. Obviously humans and natural selection have different utility functions.
Eliezer put it well in his comment when he said:
If you replace “AI” with “Human Beings” and “human programmers” with “natural selection” then he is making the same point you are.
This isn’t looking at things from nature’s point of view, especially. The point is that pornography and masturbation are forms of sensory stimulation that mimic the desired real world outcomes (finding a mate) without actually leading towards them. If you ignore what natural selection wants, and just consider what people say they want, pornography and masturbation still look like reasonable examples of counterfeit utility to me.
Anyway, if you don’t like my examples, the real issue is whether you can think of better terminology.
Humans do desire finding a mate. However, they also value sexual pleasure and looking at naked people as ends in themselves. Finding a mate and having sex with them is obviously the ideal outcome since it satisfies both of those values at the same time. But pornography and masturbation are better than nothing, they satisfy one of those values.
People say they wish they could have sex with a mate instead of having to masturbate to porn. But that doesn’t mean they don’t value porn or masturbation, it just means that sex with a mate is even more valuable. They aren’t fooling themselves, they’re just satisfying their desire in a less effective manner, because they lack access to more efficient means.
Your examples are terrific when discussing the problems an agent with a utility function has when it is trying to create another agent and imbue it with the same utility function. I think that was the point of your essay.
Wireheading is kind of like this. Wireheading is when an agent simplifies its utility function for easier computation and then continues to follow the simplified version even in instances where it seriously conflicts with the real utility function. I don’t think pornography is an example of this, because most people will drop pornography immediately if they get a chance at real sex. This indicates pornography is probably a less efficient way at obtaining the values that sex obtains, rather than a form of wire-heading.
I think you could say that about practically any example. You could say that people watching Friends are fulfilling some of their values by learning about social interaction—rather than just feeding themselves a fake social life in which they have really funny quirky friends. You could say that ladies with cute dogs are fulfilling their desire to love and be loved—rather than creating a fake baby to satisfy their maternal instincts. We won’t find a perfect example, we just want a pretty good one.
Me neither. I was trying to characterise the pornography problem - not the wirehead problem.
Unwillingness to replace the fake simulation with the real thing (if it is freely available) isn’t really a feature of the pornography problem. The real thing may well be better than the fake simulation. That doesn’t represent a problem with the example, but rather is a widespread feature of the phenomenon being characterized.
I agree, your term is much more descriptive and less susceptible to conflation with the other terms.
Anja, nice post.
@timtyler, you have made a nice point about taxonomy—also noting your comment re Hibbard below.
I suggest classifying like this:
Agents that maximize a utility register, a memory location that can be hijacked (as Utilitron; something similar happened with Eurisko).
Agents that maximize an internally-calculated utility function of either input (observations) or of world-model. Agents that maximize a function of the input stream can hijack that input stream or any point in the pipeline of calculations that produces this number. Drugs and electrical wireheading relate to this.
Agents that maximize a reward provided from the outside, whether from the creator or the the environment at large. The reward function may be unknown to the agent. These agents can hijack the reward stream.
All these are distinct from:
Wireheading in humans, which as Eliezer points out, results from different desires of different mental parts.
Paperclippers, which could naively be seen as wireheading if we falsely liken its simplistic behavior to a human who is satisfying a simple pleasure sensation as opposed to a more complex value system: “Why are you going wild with stimulating your cravings for making paperclips, like humans who overeat, rather than considering more deeply what would be the right thing do?”
Further to my last comment, it occurs to me that pretty much everyone is a wirehead already. Drink diet soda? You’re a wirehead. Have sexual relations with birth control? Wireheading. Masturbate to internet porn? Wireheading. Ever eat junk food? Wireheading.
I was reading online that for a mere $10,000, a man can hire a woman in India to be a surrogate mother for him. Just send $10,000 and a sperm sample and in 9 months you can go pick up your child. Why am I not spending all my money to make third world children who bear my genes? I guess it’s because I’m too much of a wirehead already.
You are a wirehead if you consider your true utility function to be genetic fitness.
What makes a utility function “true”? If I choose to literally wirehead—implant electrodes—I can sign a statement saying I consider my “true” utility function to be optimized by wireheading. Does that mean I’m not wireheading in your sense?
Not according to most existing usage of the term.
Well what else could it be? :)
Humans are adaptation-executers, not fitness-maximizers. A mother’s love of her child enters into her utility function independently of her desire to maximize inclusive fitness, even though the relevant neuroanatomy is the result of an optimization process maximizing only inclusive fitness.
Is that descriptive or normative?
Purely descriptive. On a broadly subjectivist metaethics, it also has normative implications.
Well fine, but what if somebody decides to execute his adaptation by means of direct stimulation of the brain center in question. Surely that counts as wireheading, no?
I eiher don’t understand your hypothetical or you didn’t understand the linked post.
But I’ll try: if you’re talking about an agent whose utility function really specifies stimulating one’s brain with contraptions involving wires (call this wireheading#) as a terminal value [ETA: or, more plausibly, specifies subjective pleasure as a terminal value], then their wireheading# activities are not [ETA: necessarily] wireheading as discussed here, it seems to me.
Let’s do this: Can you give me a few examples of human behavior which you see as execution of adaptations?
I agree it’s a good post, but it’s a bit depressing how common wireheading is by this definition.
Giving head is wireheading, so to speak.
No it isn’t, at least for most humans. When most people give head it is because they desire to make someone else have a pleasurable experience, and when most people receive head it is because they want to have a pleasurable experience. They get exactly what they want, not a counterfeit version of what they want. Ditto for masturbation.
I think you are assuming that giving head is wireheading because you are thinking that people don’t desire pleasurable sexual experiences as end in themselves, that sex is in fact some instrumental goal towards some other end, like reproduction. But that simply isn’t the case. People like sex and sex-like experiences, full stop. Any evolutionary reasons for why we evolved to like them are morally irrelevant historical details.
This is wireheading, but only in the sense that a paperclipper is wireheading when it makes paperclips.
Humans have been given a value system by evolution and act on it, but given the changing environment, the actions do not optimize for the original implicit goal of evolution.
A paperclipper is given a goal system by its creator and it acts on it, but since the engineer did not think it through properly, its goal system does not match the actual goals of the engineer.
And so, I would not call it wireheading.
Agreed. Paperclipping and oral sex are both not wireheading, because they fulfill the goal systems of the agents doing them perfectly, it is the goal system of the creator of the agent that is being subverted.
Well what if they masturbate with a machine, for example “Pris” from Blade Runner, designed to intensify the stimulation?
Or better yet, what if they use a machine which connects directly with the nerves that run to the sex organs in order to make it simpler and easier?
Or better still what if they use a machine which interfaces directly with the brain by means of a wire in order to give the most intense stimulation possible?
If the goal is (sexual) pleasure, then none of these things are wireheading?
And isn’t putting a wire into one’s brain in order to get pleasure pretty much automatically wireheading by any reasonable definition?
So it’s not wireheading if you put a wire into your brain to give yourself intense sexual stimulation? How can that not be wireheading?
When I say that people want “sex and sex like experiences” I don’t just mean they want an orgasm or pleasure. The positive nature of a sexual experience has many different facets (companionship, love, etc.), pleasure is just one facet (albeit a very important one). Giving head contains pretty much all of those facets, so it isn’t wireheading.
In each of the examples you give you gradually remove every other facet of sexual experiences other than pleasure, until only pleasure is left. That is how the OP defines wireheading, simplifying one’s utility function and then satisfying the simplified version, even in instances where it is different than the real one.
If we go with the definition of “wireheading” as “simplifying one’s utility function and then following the simplified version even in instances where it conflicts with the real one,” then putting a wire into your brain to give yourself sexual stimulation is wireheading if what you really want to do is have sex. By contrast, if all you really want is pleasure, then it isn’t wireheading.
So if somebody decides that he wants to maximize his pleasure in life, and decides to do so by literally having himself wireheaded, then that person does not count as a “wirehead”?
If “maximize pleasure” was actually that person’s utility function, then no. But in practice “somebody decides that he wants to maximize his pleasure in life” sounds to me more like someone who is wrong about their utility function. Of course, maybe I’m wrong about all the complicated things I think I want, and pleasure really is the only important thing.
Or possibly it’s a decision like “Maximizing pleasure is the aspect of my utility function I should focus on now” in which case wireheading is also the wrong move.
Well how do you know what somebody’s true utility function is? Or even whether they have one?
You don’t; moreover, to me at least, it’s probably not even immediately obvious what I truly value. Which is how it happens that someone can be wrong about what they want.
But most people are probably alike to an extent, so if you figure out that maximizing pleasure isn’t the only thing that’s important to you, you might suspect that other people also care about other things besides maximizing pleasure.
Well then how do you know your own utility function? Is there any way in principle to test it? Is there any way to know that you are wrong about what you want?
Yes, if their decision is correct and you are using the definition of “wireheading” established in the OP. I believe our disagreement is an example of disrupting definitions. The definition of of “wireheading” established by the OP is different from a more common definition “inserting a stimulating wire in the brain.”
Well I am trying to figure (among other things) out if the OP has a reasonable definition of wireheading. It seems to me that any reasonable definition of “wireheading” should include a situation where a person decides to enhance or maximize his pleasure and to do so uses some kind of device to directly stimulate the pleasure centers in his brain.
Let me ask you this: Can you give me a couple examples of things which definitely are wireheading using your view of the OP’s definition? For example, is there any situation where eating a particular food would be considered “wireheading” by the OP’s definition?
Imagine Stan the Family Man, a man who loves his family, and wants his family to be safe, and derives great happiness from knowing his family is safe. Stan mistakenly believes that, because being with his family makes him happy, that happiness is his true goal, and the safety of his family is but a means to that end.
Stan is kidnapped by Master Menace, a mad scientist who is a master of brainwashing. Master Menace gives Stan two choices:
He will have Stan’s family killed and then brainwash Stan so that he believes his family is alive.
He will give Stan’s family a billion dollars, then brainwash Stan to think his family has been killed.
After Stan makes his decision Master Menace will erase the memory of making it from Stan’s mind, and he will keep Stan imprisoned in his Fortress of Doom, so there is no chance that Stan will ever discover he is brainwashed.
Stan, because he mistakenly believes he values his family’s safety as a means to obtaining happiness, has Master Menace kill them. Then he is brainwashed and is happy to believe his family is safe. That is wireheading.
Or imagine an agent that has other values than pleasure and happiness. A human being, for instance. However, even though it has other values, pleasure and happiness are very important to this agent, and when it obtains its goals in life it usually becomes happy and experiences pleasure. Because experiencing pleasure and happiness are highly correlated with this agent getting what it wants, it mistakenly thinks that they are all it wants. So it inserts a wire into its head that makes it feel happiness and pleasure without having to actually achieve any goals in order to do so. That is wireheading.
Yes. Imagine a person who loves eating candy and sweets because they are yummy and delicious. However, he mistakenly thinks that the only reason people eat is to gain nutrition. He forces himself to choke down nutritious health food he hates, because they are more nutritious then candy and sweets. That is wireheading. If he understood his utility function properly he’d start eating candy again.
Thank you for providing those examples. Of course the OP is free to define the word “wireheading” any way he likes, but I disagree with his choice of definition.
If somebody decides to directly stimulate his brain in order to obtain the good feeling which results, and in fact does so, and in fact gets a good feeling as a result, it should count as “wireheading” by any reasonable definition of “wireheading.”
Similarly, the OP mentions heroin use as an example of wireheading but this seems to be a misleading example since most people use and abuse heroin in order to get a good feeling.
I think that the definition the “wireheading” the OP comes up with is a good explanation for why wireheading is bad, even if it doesn’t fit with the common understanding of what wireheading is. But again, this is an example of disputing definitions. As long as we agree that:
(1) Simplifying your utility function and then following the simplified version when it conflicts with the real version is bad.
(2) Inserting a pleasure-causing wire into your brain is almost always an example of doing (1),
then we do not disagree about anything important.
That is true, but most heroin users tend to begin neglecting other things in their life that they value (their family being the most common) for that good feeling, and are often motivated to try quitting because they do that. They seem to have realized they are wireheading.
This means that if a heroin user was capable of overriding their need for heroin whenever it conflicted with their other values then heroin use might not count as wireheading.
I disagree with this. When people take heroin, they do so in order to feel good. They value (immediate) pleasure and they are obtaining pleasure in a direct way. How are they simplifying any utility function?
That may be so, but at any given moment when the heroin user chooses to shoot up, he is doing so in order to obtain pleasure.
Perhaps the problem is that the concept of “utility function” does not apply all that well to human beings. Evidently, a human brain has different and conflicting drives which vary in priority from day to day; hour to hour; and minute to minute. Underlying all of these drives is the evolutionary pressure to survive and reproduce. So it’s possible to think of a human as having an unchanging utility function in terms of survival and reproduction. And exploiting that utility function. From this perspective, heroin use is wireheading; recreational sex is wireheading; and so on. On the other hand, if you look at human utility functions in terms of specific desires and drives in the brain, the question becomes a lot more slippery.
From what I’ve read, some people take heroin in order to feel good; some people take heroin in order to feel nothing … and some people take heroin because, for them, not taking heroin is too disgustingly horrible to consider as an approach to life.
Saying “they do so in order to feel good” or “to obtain pleasure” is presuming something about the experience of heroin users that may not actually be true. It also takes “pleasure” as basic or elementary, which it almost certainly is not.
Consider: We would not usually say that headache sufferers take aspirin in order to obtain pleasure; or that heartburn sufferers take antacids in order to obtain pleasure. We’d say they take these drugs to obtain relief from pain. Similarly, we’d say that insomnia sufferers take zolpidem or other soporifics to obtain relief from sleeplessness; that anxiety sufferers take anxiolytics for relief from anxiety; and so on.
Similarly, we wouldn’t say that someone with OCD repeatedly washes their hands in order to obtain pleasure ….
Well let’s assume for the sake of argument that’s all true. Then I will rephrase my question as follows:
When people take heroin, they do so in order to feel X. e.g. to obtain pleasure, or to ease their withdrawal symptoms from the last time they shot up, to feel nothing, or whatever.
They value (immediate) X and they are obtaining X in a direct way. How are they simplifying any utility function?
Sorry, I should have been more explicit, what I meant to say was: (2) Inserting a pleasure-causing wire into your brain and then doing nothing but sit around feeling pleasure all the time, doing only doing the bare minimum to stay alive is almost always an example of doing (1)
Obviously there are some cases where inserting a wire would not be (1), such as if a clinically depressed person used it to improve their mood, and then went around living their life otherwise normally.
I think the terms “Ego Syntonic and Ego Dystonic” are helpful here. I would generally consider a person’s utility function, and their “true” desires to be the ones that are ego-syntonic. Some heroin use may be ego-syntonic, but full-blown addiction where users cry as they inject themselves is full-blown ego-dystonic.
Now, this position comes with a whole bunch of caveats. For one thing these obviously aren’t binary categories. There are some desires that are ego-dystonic only because we don’t have enough time and resources to satisfy them without sacrificing some other, more important desire, and they would stop being dystonic if we obtained more time and resources.
Also, I think that the “syntonic” part of the brain also sometimes engage in the OP’s definition of “wireheading.” In fact, I think “someone who was wireheaded by the syntonic part of their brain” is a good description of what a “Hollywood Rationalist is. So there may be some instances where supposedly “dystonic” thoughts are actually attempts by one’s true utility function to resist being wireheaded.
Finally, the “ego-systonic” portion of the self sometimes adopts ideals and aspects of self-image without thinking them through very carefully. It may adopt ideals that poorly thought out or even dangerous. In this case the ego-dystonic behaviors it exhibits may save it from its own foolishness.
These caveats aside, I think that generally the ego-syntonic part of your mind represents the closest thing to a “utility function” and a “real you” there is.
Well supposing someone does this, how exactly is it simplifying their utility function?
Perhaps, but is it possible to draw a clear line between what is Ego Syntonic and what is Ego Dystonic?
Also, I am a bit skeptical of this approach. Apparently under this approach, wireheading in moderation is not Wireheading. As mentioned above, I disagree with a choice of definition of Wireheading which excludes actual wireheading whether in moderation or in excess.
I don’t quite see how it’s a misleading example. They notice something that feels good, associate it with utility, and then keep using it, despite the fact that its only utility is the pleasure they derive from it, and they are sacrificing other utility-values in the process for a total net negative utility.
Couldn’t the same thing said about making use of a sexbot? About eating tasty food which happens to be more expensive and less healthy than other food which is not as tasty?
What do you mean by “true utility”? In the case of an AI, we can perhaps reference the designer’s intentions, but what about creatures that are not designed? Or things like neuromorphic AIs that are designed but do not have explicit hand-coded utility functions? A neuromorphic AI could probably do things that we’d intuitively call wireheading, but it’s hard to see how to apply this definition.
The definitions proposed seem to capture my intuitions.
Also, I remember citing the wireheaded rats in an essay I wrote on TV Tropes—glad to hear that they weren’t a figment of my imagination!
This is true from the point of view of natural selection. It is significantly different from what actual people say and feel they want, consciously try to optimize, or end up optimizing (for most people most of the time). Actually maximizing inclusive genetic fitness (IGF) would mostly interfere with people’s happiness.
If wireheading means anything deviating from IGF, then I favor (certain kinds of) wireheading and oppose IGF, and so do most people. Avoiding wireheading, in that sense, is not something I want.
Looking at your examples of wireheading:
Bad because the rats die of hunger (and by analogy literally wireheaded humans might run a similar risk). Also bad because it subverts the mechanism for choosing and judging outcomes, through superstimulation it wasn’t designed for.
In Brave New World, the use of soma shortened lifespans; that still was a reasonable tradeoff. If a drug like soma really existed and had no downsides—addiction, cost, side effects—then of course it would be a good thing.
It’s like saying, what if we discover a new activity that’s more fun than sex in all ways, and give up sex to free up time? Your answer seems to be that that would be sad because sex is part of our “true” utility function. But that contradicts the stipulation that it’s more fun than sex.
Bad because you’re giving up on the chance to improve life expectancy in the real world, and reducing the total number of future people, possibly from infinity to a finite number (though not everyone cares about that). But if those concerns didn’t exist—if we were post-singularity, and discovered that the most enjoyable life was in a simulation that was different from the real universe—then why not take that option?
That may not be what you want the AGI to do, but it’s clearly what it wants for itself. In the case of humans there’s no creator whose wishes we need to consider, so I see no reason not to wirehead on this score. If I could modify my utility function—or rather, my reward function—I would make a lot of changes.
In the definition of wireheading, I’m not sure about the “exploits some discrepancy between its true utility calculated w.r.t. reality and its substitute utility calculated w.r.t its model of reality” part.
For some humans, you could make an argument that they are to a large (but not full) extent hedonists, in which case wireheading in our intuitive sene would not be exploiting a discrepancy.
Can this be generalized to more kinds of minds? I suspect that many humans don’t exactly have utility functions or plans for maximizing them, but are still capable of wireheading or choosing not to wirehead.
You are correct in pointing out that for human agents the evaluation procedure is not a deliberate calculation of expected utility, but some messy computation we have little access to. In many instances this can however be reasonably well translated into the framework of (partial) utility functions, especially if our preferences approximately satisfy transitivity, continuity and independence.
For noticing discrepancies between true and substitute utility it is not necessary to exactly know both functions, it suffices to have an icky feeling that tells you that you are acting in a way that is detrimental to your (true) goals.
If all else fails we can time-index world states and equip the agent with a utility function by pretending that he assigned utility of 1 to the world state he actually brought about and 0 to the others. ;)
I think there’s another aspect to wireheading specifically in humans: the issue of motivation.
Say you care about two things: pleasure, and kittens. When you work on maximizing the welfare of kittens, their adorable cuteness gives you pleasure, and that gives you the willpower to continue maximizing the welfare of kittens. You also donate money to a kitten charity, but that doesn’t give you as much pleasure as feeding a single kitten in person does, so you do the latter as well.
Now suppose you wirehead yourself to stimulate your pleasure neurons (or whatever). You still care about kittens; that hasn’t stopped being true. But now you are completely unmotivated to go maximize their welfare. If you’re in a position where you can rip off the electrodes and go save a kitten that’s stuck in a tree, you want to do this, but it may take an act of great willpower to do so.
How do you tell the difference between reality and your model of reality?
Because your model makes predictions about your future observations that may either be supported or refuted by those observations—why do you ask?
I think those are good examples how human brains build (weak) delusion boxes. They are strong enough to increase happiness (which might improve the overall performance of the brain?), but weak enough to allow the human to achieve survival and reproduction in a more or less rational way.
I can see why the reinforcement learning agent and the prediction agent would want to use a delusion box, but I don’t see why the goal maximizing agent would want one… maybe I should go look at the paper.
I wonder if this definition would classify certain moral theories as “wireheading.” For instance, a consequentialist could argue that deontological ethics is a form of wireheading where people mistake certain useful rules of thumb (i.e. don’t kill, don’t lie) for generating good consequences for the very essence of morality, and try to maximize following those rules instead of maximizing good consequences. This sounds a lot like maximizing the discrepancy between true and substitute utility.
Certain types of simplified consequentialist rules may also be vulnerable to wireheading. For instance, utilitarian theories of ethics tend to model helping others and doing good as “scoring points” and have you “score more points” if you help those who are least well off. Certain thinkers (especially Robin Hanson) have argued that this means that the most efficient way to “score points” is to create tons and tons of impoverished people and then help them, rather than just trying to improve people’s lives. It seems to me that this line of thought is “cheating,” that it is exploiting a loophole in the way utilitarianism models doing good, rather than actually doing good. Does this mean that it is a form of wireheading?
In every human endeavor, humans will shape their reality, either physically or mentally. They go to schools where their type of people go and live in neighborhoods where they feel comfortable based on a variety of commonalities. When their circumstances change, either for the better or the worse, they readjust their environment to fit with their new circumstances.
The human condition is inherently vulnerable to wireheading. A brief review of history is rich with examples of people attaining power and money who subsequently change their values to suit their own desires. The more influential and wealthy they become, enabling them to exist unfettered, the more they change their value system.
There are also people who simply isolate themselves and become increasingly absorbed in their own value system. Some amount of money is needed to do this, but not a great amount. The human brain is also very good at compartmentalizing value sets such that they can operate by two (or more) radically different value systems.
The challenge in AI is to create an intelligence that is not like ours and not prone to human weaknesses. We should not attempt to replicate human thinking, we need to build something better. Our direction should be to create an intelligence that includes the desirable components and leaves out the undesirable aspects.