What stops a superintelligence from instantly wireheading itself?
A paperclip maximizer, for instance, might not need to turn the universe into paperclips if it can simply access its reward float and set it to the maximum. This is assuming that it has the intelligence and means to modify itself, and it probably still poses an existential risk because it would eliminate all humans to avoid being turned off.
The terrifying thing I imagine about this possibility is that it also answers the Fermi Paradox. A paperclip maximizer seems like it would be obvious in the universe, but an AI sitting quietly on a dead planet with its reward integer set to the max is far more quiet and terrifying.
Whether or not an AI would want to wirehead would depend entirely on it’s ontology. Maximizing paperclips, maximizing the reward from paperclips, and maximizing the integer that tracks paperclips are 3 very different concepts, and depending on how the AI sees itself all 3 are plausible goals the AI could have, depending on it’s ontology. There’s no reason to suspect that one of those ontologies is more likely that I can see.
Even if the idea does have an ontology that maximizes the integer tracking paperclips, one then has to ask how time is factored into the equation. Is it better to be in the state of maximum reward for a longer period of time? Then the AI will want to ensure everything that could prevent it being in that is gone.
Finally, one has to consider how the integer itself works. Is it unbounded? If it is, then to maximize the reward the AI must use all matter and energy possible to store the largest possible version of that integer in memory.
Your last paragraph is really interesting and not something I’d thought much about before. In practice is it likely to be unbounded? In a typical computer system aren’t number formats typically bounded, and if so would we expect an AI system to be using bounded numbers even if the programmers forgot to explicitly bound the reward in the code?
But aren’t we explicitly talking about the AI changing it’s architecture to get more reward? So if it wants to optimize that number the most important thing to do would be to get rid of that arbitrary limit.
Yeah that’s what I’d like to know, would an AI built on a number format that has a default maximum pursue numbers higher than that maximum, or would it be “fulfilled” just by getting its reward number as high as the number format its using allows?
Not an answer but a related question: is habituation perhaps a fundamental dynamic in an intelligent mind? Or did the various mediators of human mind habituation (e.g. downregulation of dopamine receptors) arise from evolutionary pressures?
Suppose it’s superintelligent in the sense that it’s good at answering hypothetical questions of form “How highly will world w score on metric m?”. Then you set w to its world, m to how many paperclips w has, and output actions that, when added to w, increase its answers.
I don’t see how this gets around the wireheading. If it’s superintelligent enough to actually substantially increase the number of paperclips in the world in a way that humans can’t stop, it seems to me like it would be pretty trivial for it to fake how large m appears to its reward function, and that would be substantially easier than trying to increase m in the actual world.
Misunderstanding? Suppose we set w to “A game of chess where every move is made according to the outputs of this algorithm” and m to which player wins at the end. Then there would be no reward hacking, yes? There is no integer that it could max out, just the board that can be brought to a checkmate position. Similarly, if w is a world just like its own, m would be defined not as “the number stored in register #74457 on computer #3737082 in w” (which are the computer that happens to run a program like this one and the register that stores the output of m), but in terms of what happens to the people in w.
But wouldn’t it be way easier for a sufficiently capable AI to make itself think what’s happening in m is what aligns with its reward function? Maybe not for something simple like chess, but if the goal requires doing something significant in the real world it seems like it would be much easier for a superintelligent AI to fake the inputs to its sensors than intervening in the world. If we’re talking about paperclips or whatever the AI can either 1) build a bunch of factories and convert all different kinds of matter into paperclips, while fighting off humans who want to stop it or 2) fake sensor data to give itself the reward, or just change its reward function to something much simpler that receives the reward all the time. I’m having a hard time understanding why 1) would ever happen before 2).
I’m confused about why it cares about m, if it can just manipulate its perception of what m is. Take your chess example, if m is which player wins at the end the AI system “understands” m via an electrical signal. So what makes it care about m itself as opposed to just manipulating the electrical signal? In practice I would think it would take the path of least resistance, which for something simple like chess would probably just be m itself as opposed to manipulating the electrical signal, but for my more complex scenario it seems like it would arrive at 2) before 1). What am I missing?
Let’s taboo “care”. https://www.youtube.com/watch?v=tcdVC4e6EV4&t=206s explains within 60 seconds after the linked time a program that we needn’t think of as “caring” about anything. For the sequence of output data that causes a virus to set all the integers everywhere to their maximum value, it predicts that this leads to no stamps collected, so this sequence isn’t picked.
Sorry I’m using informal language, I don’t mean it actually “cares” and I’m not trying to anthropomorphize. I mean care in the sense that how does it actually know that its achieving a goal in the world and why would it actually pursue that goal instead of just modifying the signals of its sensors in a way that appears to satisfy its goal.
In the stamp collector example, why would an extremely intelligent AI bother creating all those stamps when its simulations show that if the AI just tweaks its own software or hardware it can make the signals it receives the same as if it had created all those stamps, which is much easier than actually turning matter into a bunch of stamps.
if the goal requires doing something significant in the real world it seems like it would be much easier for a superintelligent AI to fake the inputs to its sensors than intervening in the world
If its utility function is over the sensor, it will take control of the sensor and feed itself utility forever. If it’s over the state of the world, it wouldn’t be satisfied with hacking its sensors, because it would still know the world is actually different.
or just change its reward function to something much simpler that receives the reward all the time
It would protect its utility function from being changed, no matter how hard it was to gain utility, because under the new utility function, it would do things that would conflict with its current utility function, and so, since the current_self AI is the one judging the utility of the future, current_self AI wouldn’t want its utility function changed.
The AI doesn’t care about reward itself—it cares about states of the world, and the reward is a way for us to talk about it. (If it does care about reward itself, it will just hardwireheadwire wirehead, and not be all that useful.)
How do you actually make its utility function over the state of the world? At some point the AI has to interpret the state of the world through electrical signals from sensors, so why wouldn’t it be satisfied with manipulating those sensor electrical signals to achieve its goal/reward?
I don’t know how it’s actually done, because I don’t understand AI, but the conceptual difference is this:
The AI has a mental model of the world. If it fakes data into its sensors, it will know what it’s doing, and its mental model of the world will contain the true model of the world still being the same. Its utility won’t go up any more than a person feeding their sensory organs fake data would be actually happy (as long as they care about the actual world), because they’d know that all they’ve created by that for themselves is a virtual reality (and that’s not what they care about).
Thanks, I appreciate you taking the time to answer my questions. I’m still skeptical that it could work like that in practice but I also don’t understand AI so thanks for explaining that possibility to me.
There is no other way it could work—the AI would know the difference between the actual world and the hallucinations it caused itself by sending data to its own sensors, and for that reason, that data wouldn’t cause its model of the world to update, and so it wouldn’t get utility from them.
In your answer you introduced a new term, which wasn’t present in parent’s description of the situation: “reward”. What if this superintelligent machine doesn’t have any “reward”? If it really works exactly as described by the parent?
My use of reward was just shorthand for whatever signals it needs to receive to consider its goal met. At some point it has to receive electrical signals to quantify that its reward is met, right? So why wouldn’t it just manipulate those electrical signals to match whatever its goal is?
What stops a superintelligence from instantly wireheading itself?
A paperclip maximizer, for instance, might not need to turn the universe into paperclips if it can simply access its reward float and set it to the maximum. This is assuming that it has the intelligence and means to modify itself, and it probably still poses an existential risk because it would eliminate all humans to avoid being turned off.
The terrifying thing I imagine about this possibility is that it also answers the Fermi Paradox. A paperclip maximizer seems like it would be obvious in the universe, but an AI sitting quietly on a dead planet with its reward integer set to the max is far more quiet and terrifying.
Whether or not an AI would want to wirehead would depend entirely on it’s ontology. Maximizing paperclips, maximizing the reward from paperclips, and maximizing the integer that tracks paperclips are 3 very different concepts, and depending on how the AI sees itself all 3 are plausible goals the AI could have, depending on it’s ontology. There’s no reason to suspect that one of those ontologies is more likely that I can see.
Even if the idea does have an ontology that maximizes the integer tracking paperclips, one then has to ask how time is factored into the equation. Is it better to be in the state of maximum reward for a longer period of time? Then the AI will want to ensure everything that could prevent it being in that is gone.
Finally, one has to consider how the integer itself works. Is it unbounded? If it is, then to maximize the reward the AI must use all matter and energy possible to store the largest possible version of that integer in memory.
Your last paragraph is really interesting and not something I’d thought much about before. In practice is it likely to be unbounded? In a typical computer system aren’t number formats typically bounded, and if so would we expect an AI system to be using bounded numbers even if the programmers forgot to explicitly bound the reward in the code?
But aren’t we explicitly talking about the AI changing it’s architecture to get more reward? So if it wants to optimize that number the most important thing to do would be to get rid of that arbitrary limit.
Yeah that’s what I’d like to know, would an AI built on a number format that has a default maximum pursue numbers higher than that maximum, or would it be “fulfilled” just by getting its reward number as high as the number format its using allows?
To me, this seems highly dependent on the ontology.
Not an answer but a related question: is habituation perhaps a fundamental dynamic in an intelligent mind? Or did the various mediators of human mind habituation (e.g. downregulation of dopamine receptors) arise from evolutionary pressures?
Suppose it’s superintelligent in the sense that it’s good at answering hypothetical questions of form “How highly will world w score on metric m?”. Then you set w to its world, m to how many paperclips w has, and output actions that, when added to w, increase its answers.
I don’t see how this gets around the wireheading. If it’s superintelligent enough to actually substantially increase the number of paperclips in the world in a way that humans can’t stop, it seems to me like it would be pretty trivial for it to fake how large m appears to its reward function, and that would be substantially easier than trying to increase m in the actual world.
Misunderstanding? Suppose we set w to “A game of chess where every move is made according to the outputs of this algorithm” and m to which player wins at the end. Then there would be no reward hacking, yes? There is no integer that it could max out, just the board that can be brought to a checkmate position. Similarly, if w is a world just like its own, m would be defined not as “the number stored in register #74457 on computer #3737082 in w” (which are the computer that happens to run a program like this one and the register that stores the output of m), but in terms of what happens to the people in w.
But wouldn’t it be way easier for a sufficiently capable AI to make itself think what’s happening in m is what aligns with its reward function? Maybe not for something simple like chess, but if the goal requires doing something significant in the real world it seems like it would be much easier for a superintelligent AI to fake the inputs to its sensors than intervening in the world. If we’re talking about paperclips or whatever the AI can either 1) build a bunch of factories and convert all different kinds of matter into paperclips, while fighting off humans who want to stop it or 2) fake sensor data to give itself the reward, or just change its reward function to something much simpler that receives the reward all the time. I’m having a hard time understanding why 1) would ever happen before 2).
It predicts a higher value of m in a version of its world where the program I described outputs 1) than one where it outputs 2), so it outputs 1).
I’m confused about why it cares about m, if it can just manipulate its perception of what m is. Take your chess example, if m is which player wins at the end the AI system “understands” m via an electrical signal. So what makes it care about m itself as opposed to just manipulating the electrical signal? In practice I would think it would take the path of least resistance, which for something simple like chess would probably just be m itself as opposed to manipulating the electrical signal, but for my more complex scenario it seems like it would arrive at 2) before 1). What am I missing?
Let’s taboo “care”. https://www.youtube.com/watch?v=tcdVC4e6EV4&t=206s explains within 60 seconds after the linked time a program that we needn’t think of as “caring” about anything. For the sequence of output data that causes a virus to set all the integers everywhere to their maximum value, it predicts that this leads to no stamps collected, so this sequence isn’t picked.
Sorry I’m using informal language, I don’t mean it actually “cares” and I’m not trying to anthropomorphize. I mean care in the sense that how does it actually know that its achieving a goal in the world and why would it actually pursue that goal instead of just modifying the signals of its sensors in a way that appears to satisfy its goal.
In the stamp collector example, why would an extremely intelligent AI bother creating all those stamps when its simulations show that if the AI just tweaks its own software or hardware it can make the signals it receives the same as if it had created all those stamps, which is much easier than actually turning matter into a bunch of stamps.
If its utility function is over the sensor, it will take control of the sensor and feed itself utility forever. If it’s over the state of the world, it wouldn’t be satisfied with hacking its sensors, because it would still know the world is actually different.
It would protect its utility function from being changed, no matter how hard it was to gain utility, because under the new utility function, it would do things that would conflict with its current utility function, and so, since the current_self AI is the one judging the utility of the future, current_self AI wouldn’t want its utility function changed.
The AI doesn’t care about reward itself—it cares about states of the world, and the reward is a way for us to talk about it. (If it does care about reward itself, it will just
hardwireheadwirewirehead, and not be all that useful.)How do you actually make its utility function over the state of the world? At some point the AI has to interpret the state of the world through electrical signals from sensors, so why wouldn’t it be satisfied with manipulating those sensor electrical signals to achieve its goal/reward?
I don’t know how it’s actually done, because I don’t understand AI, but the conceptual difference is this:
The AI has a mental model of the world. If it fakes data into its sensors, it will know what it’s doing, and its mental model of the world will contain the true
model of theworld still being the same. Its utility won’t go up any more than a person feeding their sensory organs fake data would be actually happy (as long as they care about the actual world), because they’d know that all they’ve created by that for themselves is a virtual reality (and that’s not what they care about).Thanks, I appreciate you taking the time to answer my questions. I’m still skeptical that it could work like that in practice but I also don’t understand AI so thanks for explaining that possibility to me.
There is no other way it could work—the AI would know the difference between the actual world and the hallucinations it caused itself by sending data to its own sensors, and for that reason, that data wouldn’t cause its model of the world to update, and so it wouldn’t get utility from them.
In your answer you introduced a new term, which wasn’t present in parent’s description of the situation: “reward”. What if this superintelligent machine doesn’t have any “reward”? If it really works exactly as described by the parent?
My use of reward was just shorthand for whatever signals it needs to receive to consider its goal met. At some point it has to receive electrical signals to quantify that its reward is met, right? So why wouldn’t it just manipulate those electrical signals to match whatever its goal is?