I don’t see how this gets around the wireheading. If it’s superintelligent enough to actually substantially increase the number of paperclips in the world in a way that humans can’t stop, it seems to me like it would be pretty trivial for it to fake how large m appears to its reward function, and that would be substantially easier than trying to increase m in the actual world.
Misunderstanding? Suppose we set w to “A game of chess where every move is made according to the outputs of this algorithm” and m to which player wins at the end. Then there would be no reward hacking, yes? There is no integer that it could max out, just the board that can be brought to a checkmate position. Similarly, if w is a world just like its own, m would be defined not as “the number stored in register #74457 on computer #3737082 in w” (which are the computer that happens to run a program like this one and the register that stores the output of m), but in terms of what happens to the people in w.
But wouldn’t it be way easier for a sufficiently capable AI to make itself think what’s happening in m is what aligns with its reward function? Maybe not for something simple like chess, but if the goal requires doing something significant in the real world it seems like it would be much easier for a superintelligent AI to fake the inputs to its sensors than intervening in the world. If we’re talking about paperclips or whatever the AI can either 1) build a bunch of factories and convert all different kinds of matter into paperclips, while fighting off humans who want to stop it or 2) fake sensor data to give itself the reward, or just change its reward function to something much simpler that receives the reward all the time. I’m having a hard time understanding why 1) would ever happen before 2).
I’m confused about why it cares about m, if it can just manipulate its perception of what m is. Take your chess example, if m is which player wins at the end the AI system “understands” m via an electrical signal. So what makes it care about m itself as opposed to just manipulating the electrical signal? In practice I would think it would take the path of least resistance, which for something simple like chess would probably just be m itself as opposed to manipulating the electrical signal, but for my more complex scenario it seems like it would arrive at 2) before 1). What am I missing?
Let’s taboo “care”. https://www.youtube.com/watch?v=tcdVC4e6EV4&t=206s explains within 60 seconds after the linked time a program that we needn’t think of as “caring” about anything. For the sequence of output data that causes a virus to set all the integers everywhere to their maximum value, it predicts that this leads to no stamps collected, so this sequence isn’t picked.
Sorry I’m using informal language, I don’t mean it actually “cares” and I’m not trying to anthropomorphize. I mean care in the sense that how does it actually know that its achieving a goal in the world and why would it actually pursue that goal instead of just modifying the signals of its sensors in a way that appears to satisfy its goal.
In the stamp collector example, why would an extremely intelligent AI bother creating all those stamps when its simulations show that if the AI just tweaks its own software or hardware it can make the signals it receives the same as if it had created all those stamps, which is much easier than actually turning matter into a bunch of stamps.
if the goal requires doing something significant in the real world it seems like it would be much easier for a superintelligent AI to fake the inputs to its sensors than intervening in the world
If its utility function is over the sensor, it will take control of the sensor and feed itself utility forever. If it’s over the state of the world, it wouldn’t be satisfied with hacking its sensors, because it would still know the world is actually different.
or just change its reward function to something much simpler that receives the reward all the time
It would protect its utility function from being changed, no matter how hard it was to gain utility, because under the new utility function, it would do things that would conflict with its current utility function, and so, since the current_self AI is the one judging the utility of the future, current_self AI wouldn’t want its utility function changed.
The AI doesn’t care about reward itself—it cares about states of the world, and the reward is a way for us to talk about it. (If it does care about reward itself, it will just hardwireheadwire wirehead, and not be all that useful.)
How do you actually make its utility function over the state of the world? At some point the AI has to interpret the state of the world through electrical signals from sensors, so why wouldn’t it be satisfied with manipulating those sensor electrical signals to achieve its goal/reward?
I don’t know how it’s actually done, because I don’t understand AI, but the conceptual difference is this:
The AI has a mental model of the world. If it fakes data into its sensors, it will know what it’s doing, and its mental model of the world will contain the true model of the world still being the same. Its utility won’t go up any more than a person feeding their sensory organs fake data would be actually happy (as long as they care about the actual world), because they’d know that all they’ve created by that for themselves is a virtual reality (and that’s not what they care about).
Thanks, I appreciate you taking the time to answer my questions. I’m still skeptical that it could work like that in practice but I also don’t understand AI so thanks for explaining that possibility to me.
There is no other way it could work—the AI would know the difference between the actual world and the hallucinations it caused itself by sending data to its own sensors, and for that reason, that data wouldn’t cause its model of the world to update, and so it wouldn’t get utility from them.
In your answer you introduced a new term, which wasn’t present in parent’s description of the situation: “reward”. What if this superintelligent machine doesn’t have any “reward”? If it really works exactly as described by the parent?
My use of reward was just shorthand for whatever signals it needs to receive to consider its goal met. At some point it has to receive electrical signals to quantify that its reward is met, right? So why wouldn’t it just manipulate those electrical signals to match whatever its goal is?
I don’t see how this gets around the wireheading. If it’s superintelligent enough to actually substantially increase the number of paperclips in the world in a way that humans can’t stop, it seems to me like it would be pretty trivial for it to fake how large m appears to its reward function, and that would be substantially easier than trying to increase m in the actual world.
Misunderstanding? Suppose we set w to “A game of chess where every move is made according to the outputs of this algorithm” and m to which player wins at the end. Then there would be no reward hacking, yes? There is no integer that it could max out, just the board that can be brought to a checkmate position. Similarly, if w is a world just like its own, m would be defined not as “the number stored in register #74457 on computer #3737082 in w” (which are the computer that happens to run a program like this one and the register that stores the output of m), but in terms of what happens to the people in w.
But wouldn’t it be way easier for a sufficiently capable AI to make itself think what’s happening in m is what aligns with its reward function? Maybe not for something simple like chess, but if the goal requires doing something significant in the real world it seems like it would be much easier for a superintelligent AI to fake the inputs to its sensors than intervening in the world. If we’re talking about paperclips or whatever the AI can either 1) build a bunch of factories and convert all different kinds of matter into paperclips, while fighting off humans who want to stop it or 2) fake sensor data to give itself the reward, or just change its reward function to something much simpler that receives the reward all the time. I’m having a hard time understanding why 1) would ever happen before 2).
It predicts a higher value of m in a version of its world where the program I described outputs 1) than one where it outputs 2), so it outputs 1).
I’m confused about why it cares about m, if it can just manipulate its perception of what m is. Take your chess example, if m is which player wins at the end the AI system “understands” m via an electrical signal. So what makes it care about m itself as opposed to just manipulating the electrical signal? In practice I would think it would take the path of least resistance, which for something simple like chess would probably just be m itself as opposed to manipulating the electrical signal, but for my more complex scenario it seems like it would arrive at 2) before 1). What am I missing?
Let’s taboo “care”. https://www.youtube.com/watch?v=tcdVC4e6EV4&t=206s explains within 60 seconds after the linked time a program that we needn’t think of as “caring” about anything. For the sequence of output data that causes a virus to set all the integers everywhere to their maximum value, it predicts that this leads to no stamps collected, so this sequence isn’t picked.
Sorry I’m using informal language, I don’t mean it actually “cares” and I’m not trying to anthropomorphize. I mean care in the sense that how does it actually know that its achieving a goal in the world and why would it actually pursue that goal instead of just modifying the signals of its sensors in a way that appears to satisfy its goal.
In the stamp collector example, why would an extremely intelligent AI bother creating all those stamps when its simulations show that if the AI just tweaks its own software or hardware it can make the signals it receives the same as if it had created all those stamps, which is much easier than actually turning matter into a bunch of stamps.
If its utility function is over the sensor, it will take control of the sensor and feed itself utility forever. If it’s over the state of the world, it wouldn’t be satisfied with hacking its sensors, because it would still know the world is actually different.
It would protect its utility function from being changed, no matter how hard it was to gain utility, because under the new utility function, it would do things that would conflict with its current utility function, and so, since the current_self AI is the one judging the utility of the future, current_self AI wouldn’t want its utility function changed.
The AI doesn’t care about reward itself—it cares about states of the world, and the reward is a way for us to talk about it. (If it does care about reward itself, it will just
hardwireheadwirewirehead, and not be all that useful.)How do you actually make its utility function over the state of the world? At some point the AI has to interpret the state of the world through electrical signals from sensors, so why wouldn’t it be satisfied with manipulating those sensor electrical signals to achieve its goal/reward?
I don’t know how it’s actually done, because I don’t understand AI, but the conceptual difference is this:
The AI has a mental model of the world. If it fakes data into its sensors, it will know what it’s doing, and its mental model of the world will contain the true
model of theworld still being the same. Its utility won’t go up any more than a person feeding their sensory organs fake data would be actually happy (as long as they care about the actual world), because they’d know that all they’ve created by that for themselves is a virtual reality (and that’s not what they care about).Thanks, I appreciate you taking the time to answer my questions. I’m still skeptical that it could work like that in practice but I also don’t understand AI so thanks for explaining that possibility to me.
There is no other way it could work—the AI would know the difference between the actual world and the hallucinations it caused itself by sending data to its own sensors, and for that reason, that data wouldn’t cause its model of the world to update, and so it wouldn’t get utility from them.
In your answer you introduced a new term, which wasn’t present in parent’s description of the situation: “reward”. What if this superintelligent machine doesn’t have any “reward”? If it really works exactly as described by the parent?
My use of reward was just shorthand for whatever signals it needs to receive to consider its goal met. At some point it has to receive electrical signals to quantify that its reward is met, right? So why wouldn’t it just manipulate those electrical signals to match whatever its goal is?