However, the setup seems to depend crucially on having a good set of utilities to make it work. For example, let u_A be the usual paperclipping utility, and define U^+ = “all observation-action utilities”, and U^- = “all utilities that are defined over human behaviour + u_A”.
Then suppose action a is a default, equivalent to “turn off your observations for an hour”. And action a’ is “unleash a sub-agent that will kill all humans, replace them all with robots that behave as humans would in a, then goes out into the universe to discreetly build paperclips; the subagent does this within an hour’s time. And also turn off your observations for an hour”.
Then none of U^- will detect anything different, since the human (or human-replacing robots) actions are the same, but u_A will get massively increased.
Now, your theorem 1 still applies for U^+ (btw, there is a slight flaw in the proof: you have to consider utilities that differ on histories where all the observations are the same, but the actions are not). But even here, all histories will be the same, except for one action: a’ instead of a at one particular point.
Compare this with a″ = “build one paperclip”. Then your setup should, if I’m understanding it correctly, penalise a″ more than a’. Because a″ entails a divergence in future observations, while a’ does not. It seems that your setup doesn’t directly capture the fact that a’ vs a is an absolutely massive change to the universe, whereas a″ is not.
Of course, this could be avoided if U^+ included utilities that got a big penalty (or a big increase!) for humans-being-replaced; but this just gets back to the fact that we have to define the utilities carefully, and in ways that reflect our values to some extent.
Compare this with a″ = “build one paperclip”. Then your setup should, if I’m understanding it correctly, penalise a″ more than a’. Because a″ entails a divergence in future observations, while a’ does not.
This is not the case. I think there’s a confusion here—Q_u is not with respect to what we’ve seen so far. We aren’t taking the utility difference between u(h_action) and u(h_inaction). We’re instead comparing our ability to maximize each utility. How does a’ not drastically improve our ability to maximize arbitrary utilities, or at least use up a ton of energy in the process compared to not doing it?
That is, it’s not what we’ve seen so far that matters, per se. You’ll notice that the Q_u equation doesn’t even consider what happened before the new utility “takes over”.
Consider the end of the plan – the agent is now in a position to spread and expand and complete arbitrary goals to a much greater degree than if it had just done nothing for an hour. The fact that human behavior seems the same isn’t actually very relevant.
The U^- do, however, all have fewer resources available now, even if we assume they don’t mind what just happened.
there is a slight flaw in the proof: you have to consider utilities that differ on histories where all the observations are the same, but the actions are not
Can you expand? I thought that this was covered by the first part.
The subagent could, for example, secret caches of extra energy around the place (energy that would otherwise be wasted). Or maybe the human-robots are just more efficient, in ways that make up for any lost resources. Or, more worryingly, maybe the AI is destined to be run for a short while and then turned off, and a’ is the only action it could ever take to “escape”. Thus its expected utility for all the U’s are pretty much the same (and pretty low) for all utilities.
(and if you include arbitrary utilities, these include impossible-to-not-affect ones, like ones dependent on the waste heat rate of the AI; I don’t see the complexity penalty corresponding well to a “unimportant to humans” measure).
The subagent could, for example, secret caches of extra energy around the place (energy that would otherwise be wasted).
Ok, supposing that can work: “build world dominating device and then spread into space” still drastically increases power.
Or maybe the human-robots are just more efficient, in ways that make up for any lost resources.
What are your intuitions for why building the sub-agent to begin with isn’t extremely instrumentally convergent?
Or, more worryingly, maybe the AI is destined to be run for a short while and then turned off, and a’ is the only action it could ever take to “escape”.
Lemma 3 proves it does nothing here. One of the main points of AUP is that it seems to cripple the survival incentive, so I’m confused why this came to mind.
expected utility for all the U’s are pretty much the same (and pretty low) for all utilities.
Utility != attainable utility.
“how many dollars I have” != “how many additional dollars I could have within a year if I acted optimally”
like ones dependent on the waste heat rate of the AI; I don’t see the complexity penalty corresponding well to a “unimportant to humans” measure
Again, it’s not about utility change, but attainable utility change. The distinction is crucial.
I assert that we don’t need such a measure, for the reasons that I outlined in Utility Selection. That is, the content of the utilities seems to not actually be that important.
How can an agent become really powerful without also becoming really powerful, therefore getting penalized?
Ok, supposing that can work: “build world dominating device and then spread into space” still drastically increases power.
Not if the subagent is designed to not allow the increase in power. As in, the subagent is designed to maximise uA, but also to ensure the penalty term is as small as possible. The subagent is deliberately beyond the control of the initial agent.
(is this the issue we are disagreeing about, or have I misunderstood?)
I think that you think that the method uses expected utility (which it doesn’t), so we’re secretly talking about really different things. The method doesn’t actually break in the ways you’ve said so far (as far as I can tell), but an expected utility method would. Compare how the post’s examples break down under your model, and contrast with the math I provided for attainable utilities.
For example, the “EU difference over plans” model applies .316 penalty to disabling the off-switch (due to coincidentally capturing change in the agent’s actual utility for that time step, since it’s in the left square), yet my calculations show .95 penalty. Along these lines, AUP provably never tries to escape certain shutdown (that is, it expects to be shut down no matter which action it chooses at the present step), while an EU method would have no compunctions about doing so in many situations (as you pointed out earlier).
Anyways, in general, supposing that the agent instantaneously builds successors that do things it can’t control either impedes its ability to maximize other objectives (if the agent can’t hijack this new
powerful agent), or improves it (if the agent can hijack this new powerful agent).
Suppose the AI is addressing a letter containing $1,000,000. It can address this to Jane Brown, or to John Smith. Once addressed, AI will be turned off, and the letter will be posted.
A utility uB that values Jane Brown would like the letter addressed to her, and vice versa for a utility uS that values John Smith. These two utilities differ only on the action the AI takes, not on subsequent observations. Therefore “This implies that by choosing a, the agent expects to observe some uA-high scoring oA with greater probability than if it had selected ∅” is false—it need not expect to observe anything at all.
However the theorem is still true, because we just need to consider utilities that differ on actions—such as uB and uS.
Good work! Lots of interesting stuff there.
However, the setup seems to depend crucially on having a good set of utilities to make it work. For example, let u_A be the usual paperclipping utility, and define U^+ = “all observation-action utilities”, and U^- = “all utilities that are defined over human behaviour + u_A”.
Then suppose action a is a default, equivalent to “turn off your observations for an hour”. And action a’ is “unleash a sub-agent that will kill all humans, replace them all with robots that behave as humans would in a, then goes out into the universe to discreetly build paperclips; the subagent does this within an hour’s time. And also turn off your observations for an hour”.
Then none of U^- will detect anything different, since the human (or human-replacing robots) actions are the same, but u_A will get massively increased.
Now, your theorem 1 still applies for U^+ (btw, there is a slight flaw in the proof: you have to consider utilities that differ on histories where all the observations are the same, but the actions are not). But even here, all histories will be the same, except for one action: a’ instead of a at one particular point.
Compare this with a″ = “build one paperclip”. Then your setup should, if I’m understanding it correctly, penalise a″ more than a’. Because a″ entails a divergence in future observations, while a’ does not. It seems that your setup doesn’t directly capture the fact that a’ vs a is an absolutely massive change to the universe, whereas a″ is not.
Of course, this could be avoided if U^+ included utilities that got a big penalty (or a big increase!) for humans-being-replaced; but this just gets back to the fact that we have to define the utilities carefully, and in ways that reflect our values to some extent.
This is not the case. I think there’s a confusion here—Q_u is not with respect to what we’ve seen so far. We aren’t taking the utility difference between u(h_action) and u(h_inaction). We’re instead comparing our ability to maximize each utility. How does a’ not drastically improve our ability to maximize arbitrary utilities, or at least use up a ton of energy in the process compared to not doing it?
That is, it’s not what we’ve seen so far that matters, per se. You’ll notice that the Q_u equation doesn’t even consider what happened before the new utility “takes over”.
Consider the end of the plan – the agent is now in a position to spread and expand and complete arbitrary goals to a much greater degree than if it had just done nothing for an hour. The fact that human behavior seems the same isn’t actually very relevant.
The U^- do, however, all have fewer resources available now, even if we assume they don’t mind what just happened.
Can you expand? I thought that this was covered by the first part.
The subagent could, for example, secret caches of extra energy around the place (energy that would otherwise be wasted). Or maybe the human-robots are just more efficient, in ways that make up for any lost resources. Or, more worryingly, maybe the AI is destined to be run for a short while and then turned off, and a’ is the only action it could ever take to “escape”. Thus its expected utility for all the U’s are pretty much the same (and pretty low) for all utilities.
(and if you include arbitrary utilities, these include impossible-to-not-affect ones, like ones dependent on the waste heat rate of the AI; I don’t see the complexity penalty corresponding well to a “unimportant to humans” measure).
Ok, supposing that can work: “build world dominating device and then spread into space” still drastically increases power.
What are your intuitions for why building the sub-agent to begin with isn’t extremely instrumentally convergent?
Lemma 3 proves it does nothing here. One of the main points of AUP is that it seems to cripple the survival incentive, so I’m confused why this came to mind.
Utility != attainable utility. “how many dollars I have” != “how many additional dollars I could have within a year if I acted optimally”
Again, it’s not about utility change, but attainable utility change. The distinction is crucial.
I assert that we don’t need such a measure, for the reasons that I outlined in Utility Selection. That is, the content of the utilities seems to not actually be that important.
How can an agent become really powerful without also becoming really powerful, therefore getting penalized?
Not if the subagent is designed to not allow the increase in power. As in, the subagent is designed to maximise uA, but also to ensure the penalty term is as small as possible. The subagent is deliberately beyond the control of the initial agent.
(is this the issue we are disagreeing about, or have I misunderstood?)
I think that you think that the method uses expected utility (which it doesn’t), so we’re secretly talking about really different things. The method doesn’t actually break in the ways you’ve said so far (as far as I can tell), but an expected utility method would. Compare how the post’s examples break down under your model, and contrast with the math I provided for attainable utilities.
For example, the “EU difference over plans” model applies .316 penalty to disabling the off-switch (due to coincidentally capturing change in the agent’s actual utility for that time step, since it’s in the left square), yet my calculations show .95 penalty. Along these lines, AUP provably never tries to escape certain shutdown (that is, it expects to be shut down no matter which action it chooses at the present step), while an EU method would have no compunctions about doing so in many situations (as you pointed out earlier).
Anyways, in general, supposing that the agent instantaneously builds successors that do things it can’t control either impedes its ability to maximize other objectives (if the agent can’t hijack this new powerful agent), or improves it (if the agent can hijack this new powerful agent).
Here is a writeup of the problem I believe your method has: https://www.lesswrong.com/posts/6EMdmeosYPdn74wuG/wireheading-as-potential-problem-with-the-new-impact-measure
Suppose the AI is addressing a letter containing $1,000,000. It can address this to Jane Brown, or to John Smith. Once addressed, AI will be turned off, and the letter will be posted.
A utility uB that values Jane Brown would like the letter addressed to her, and vice versa for a utility uS that values John Smith. These two utilities differ only on the action the AI takes, not on subsequent observations. Therefore “This implies that by choosing a, the agent expects to observe some uA-high scoring oA with greater probability than if it had selected ∅” is false—it need not expect to observe anything at all.
However the theorem is still true, because we just need to consider utilities that differ on actions—such as uB and uS.