if the goal requires doing something significant in the real world it seems like it would be much easier for a superintelligent AI to fake the inputs to its sensors than intervening in the world
If its utility function is over the sensor, it will take control of the sensor and feed itself utility forever. If it’s over the state of the world, it wouldn’t be satisfied with hacking its sensors, because it would still know the world is actually different.
or just change its reward function to something much simpler that receives the reward all the time
It would protect its utility function from being changed, no matter how hard it was to gain utility, because under the new utility function, it would do things that would conflict with its current utility function, and so, since the current_self AI is the one judging the utility of the future, current_self AI wouldn’t want its utility function changed.
The AI doesn’t care about reward itself—it cares about states of the world, and the reward is a way for us to talk about it. (If it does care about reward itself, it will just hardwireheadwire wirehead, and not be all that useful.)
How do you actually make its utility function over the state of the world? At some point the AI has to interpret the state of the world through electrical signals from sensors, so why wouldn’t it be satisfied with manipulating those sensor electrical signals to achieve its goal/reward?
I don’t know how it’s actually done, because I don’t understand AI, but the conceptual difference is this:
The AI has a mental model of the world. If it fakes data into its sensors, it will know what it’s doing, and its mental model of the world will contain the true model of the world still being the same. Its utility won’t go up any more than a person feeding their sensory organs fake data would be actually happy (as long as they care about the actual world), because they’d know that all they’ve created by that for themselves is a virtual reality (and that’s not what they care about).
Thanks, I appreciate you taking the time to answer my questions. I’m still skeptical that it could work like that in practice but I also don’t understand AI so thanks for explaining that possibility to me.
There is no other way it could work—the AI would know the difference between the actual world and the hallucinations it caused itself by sending data to its own sensors, and for that reason, that data wouldn’t cause its model of the world to update, and so it wouldn’t get utility from them.
If its utility function is over the sensor, it will take control of the sensor and feed itself utility forever. If it’s over the state of the world, it wouldn’t be satisfied with hacking its sensors, because it would still know the world is actually different.
It would protect its utility function from being changed, no matter how hard it was to gain utility, because under the new utility function, it would do things that would conflict with its current utility function, and so, since the current_self AI is the one judging the utility of the future, current_self AI wouldn’t want its utility function changed.
The AI doesn’t care about reward itself—it cares about states of the world, and the reward is a way for us to talk about it. (If it does care about reward itself, it will just
hardwireheadwirewirehead, and not be all that useful.)How do you actually make its utility function over the state of the world? At some point the AI has to interpret the state of the world through electrical signals from sensors, so why wouldn’t it be satisfied with manipulating those sensor electrical signals to achieve its goal/reward?
I don’t know how it’s actually done, because I don’t understand AI, but the conceptual difference is this:
The AI has a mental model of the world. If it fakes data into its sensors, it will know what it’s doing, and its mental model of the world will contain the true
model of theworld still being the same. Its utility won’t go up any more than a person feeding their sensory organs fake data would be actually happy (as long as they care about the actual world), because they’d know that all they’ve created by that for themselves is a virtual reality (and that’s not what they care about).Thanks, I appreciate you taking the time to answer my questions. I’m still skeptical that it could work like that in practice but I also don’t understand AI so thanks for explaining that possibility to me.
There is no other way it could work—the AI would know the difference between the actual world and the hallucinations it caused itself by sending data to its own sensors, and for that reason, that data wouldn’t cause its model of the world to update, and so it wouldn’t get utility from them.