Case 3 is not safe, because controlling the physical world is a useful way to control the simulation you’re in. (E.g., killing all agents in base reality ensures that they’ll never shut down your simulation.)
In my mind, this is still making the mistake of not distinguishing the true domain of the agent’s utility function from ours.
Whether the simulation continues to be instantiated in some computer in our world is a fact about our world, not about the simulated world.
AlphaGo doesn’t care about being unplugged in the middle of a game (unless that dynamic was part of its training data). It cares about the platonic game of go, not about the instantiated game it’s currently playing.
We need to worry about leaky abstractions, as per my original comment. So we can’t always assume the agent’s domain is what we’d ideally want it to be.
But I’m trying to highlight that it’s possible (and I would tentatively go further and say probable) for agents not to care about the real world.
To me, assuming care about the real world (including wanting not to be unplugged) seems like a form of anthropomorphism.
For any given agent-y system I think we need to analyze whether it in particular would come to care about real world events. I don’t think we can assume in general one way or the other.
AlphaGo doesn’t care about being unplugged in the middle of a game (unless that dynamic was part of its training data). It cares about the platonic game of go, not about the instantiated game it’s currently playing.
What if the programmers intervene mid-game to give the other side an advantage? Does a Go AGI, as you’re thinking of it, care about that?
I’m not following why a Go AGI (with the ability to think about the physical world, but a utility function that only cares about states of the simulation) wouldn’t want to seize more hardware, so that it can think better and thereby win more often in the simulation; or gain control of its hardware and directly edit the simulation so that it wins as many games as possible as quickly as possible.
Why would having a utility function that only assigns utility based on X make you indifferent to non-X things that causally affect X? If I only terminally cared about things that happened a year from now, I would still try to shape the intervening time because doing so will change what happens a year from now.
(This is maybe less clear in the case of shutdown, because it’s not clear how an agent should think about shutdown if its utility is defined states of its simulation. So I’ll set that particular case aside.)
A Go AI that learns to play go via reinforcement learning might not “have a utility function that only cares about winning Go”. Using standard utility theory, you could observe its actions and try to rationalise them as if they were maximising some utility function, and the utility function you come up with probably wouldn’t be “win every game of Go you start playing” (what you actually come up with will depend, presumably, on algorithmic and training regime details). The reason why the utility function is slippery is that it’s fundamentally an adaptation executor, not a utility maxmiser.
In my mind, this is still making the mistake of not distinguishing the true domain of the agent’s utility function from ours.
Whether the simulation continues to be instantiated in some computer in our world is a fact about our world, not about the simulated world.
AlphaGo doesn’t care about being unplugged in the middle of a game (unless that dynamic was part of its training data). It cares about the platonic game of go, not about the instantiated game it’s currently playing.
We need to worry about leaky abstractions, as per my original comment. So we can’t always assume the agent’s domain is what we’d ideally want it to be.
But I’m trying to highlight that it’s possible (and I would tentatively go further and say probable) for agents not to care about the real world.
To me, assuming care about the real world (including wanting not to be unplugged) seems like a form of anthropomorphism.
For any given agent-y system I think we need to analyze whether it in particular would come to care about real world events. I don’t think we can assume in general one way or the other.
What if the programmers intervene mid-game to give the other side an advantage? Does a Go AGI, as you’re thinking of it, care about that?
I’m not following why a Go AGI (with the ability to think about the physical world, but a utility function that only cares about states of the simulation) wouldn’t want to seize more hardware, so that it can think better and thereby win more often in the simulation; or gain control of its hardware and directly edit the simulation so that it wins as many games as possible as quickly as possible.
Why would having a utility function that only assigns utility based on X make you indifferent to non-X things that causally affect X? If I only terminally cared about things that happened a year from now, I would still try to shape the intervening time because doing so will change what happens a year from now.
(This is maybe less clear in the case of shutdown, because it’s not clear how an agent should think about shutdown if its utility is defined states of its simulation. So I’ll set that particular case aside.)
A Go AI that learns to play go via reinforcement learning might not “have a utility function that only cares about winning Go”. Using standard utility theory, you could observe its actions and try to rationalise them as if they were maximising some utility function, and the utility function you come up with probably wouldn’t be “win every game of Go you start playing” (what you actually come up with will depend, presumably, on algorithmic and training regime details). The reason why the utility function is slippery is that it’s fundamentally an adaptation executor, not a utility maxmiser.