The subagent could, for example, secret caches of extra energy around the place (energy that would otherwise be wasted). Or maybe the human-robots are just more efficient, in ways that make up for any lost resources. Or, more worryingly, maybe the AI is destined to be run for a short while and then turned off, and a’ is the only action it could ever take to “escape”. Thus its expected utility for all the U’s are pretty much the same (and pretty low) for all utilities.
(and if you include arbitrary utilities, these include impossible-to-not-affect ones, like ones dependent on the waste heat rate of the AI; I don’t see the complexity penalty corresponding well to a “unimportant to humans” measure).
The subagent could, for example, secret caches of extra energy around the place (energy that would otherwise be wasted).
Ok, supposing that can work: “build world dominating device and then spread into space” still drastically increases power.
Or maybe the human-robots are just more efficient, in ways that make up for any lost resources.
What are your intuitions for why building the sub-agent to begin with isn’t extremely instrumentally convergent?
Or, more worryingly, maybe the AI is destined to be run for a short while and then turned off, and a’ is the only action it could ever take to “escape”.
Lemma 3 proves it does nothing here. One of the main points of AUP is that it seems to cripple the survival incentive, so I’m confused why this came to mind.
expected utility for all the U’s are pretty much the same (and pretty low) for all utilities.
Utility != attainable utility.
“how many dollars I have” != “how many additional dollars I could have within a year if I acted optimally”
like ones dependent on the waste heat rate of the AI; I don’t see the complexity penalty corresponding well to a “unimportant to humans” measure
Again, it’s not about utility change, but attainable utility change. The distinction is crucial.
I assert that we don’t need such a measure, for the reasons that I outlined in Utility Selection. That is, the content of the utilities seems to not actually be that important.
How can an agent become really powerful without also becoming really powerful, therefore getting penalized?
Ok, supposing that can work: “build world dominating device and then spread into space” still drastically increases power.
Not if the subagent is designed to not allow the increase in power. As in, the subagent is designed to maximise uA, but also to ensure the penalty term is as small as possible. The subagent is deliberately beyond the control of the initial agent.
(is this the issue we are disagreeing about, or have I misunderstood?)
I think that you think that the method uses expected utility (which it doesn’t), so we’re secretly talking about really different things. The method doesn’t actually break in the ways you’ve said so far (as far as I can tell), but an expected utility method would. Compare how the post’s examples break down under your model, and contrast with the math I provided for attainable utilities.
For example, the “EU difference over plans” model applies .316 penalty to disabling the off-switch (due to coincidentally capturing change in the agent’s actual utility for that time step, since it’s in the left square), yet my calculations show .95 penalty. Along these lines, AUP provably never tries to escape certain shutdown (that is, it expects to be shut down no matter which action it chooses at the present step), while an EU method would have no compunctions about doing so in many situations (as you pointed out earlier).
Anyways, in general, supposing that the agent instantaneously builds successors that do things it can’t control either impedes its ability to maximize other objectives (if the agent can’t hijack this new
powerful agent), or improves it (if the agent can hijack this new powerful agent).
The subagent could, for example, secret caches of extra energy around the place (energy that would otherwise be wasted). Or maybe the human-robots are just more efficient, in ways that make up for any lost resources. Or, more worryingly, maybe the AI is destined to be run for a short while and then turned off, and a’ is the only action it could ever take to “escape”. Thus its expected utility for all the U’s are pretty much the same (and pretty low) for all utilities.
(and if you include arbitrary utilities, these include impossible-to-not-affect ones, like ones dependent on the waste heat rate of the AI; I don’t see the complexity penalty corresponding well to a “unimportant to humans” measure).
Ok, supposing that can work: “build world dominating device and then spread into space” still drastically increases power.
What are your intuitions for why building the sub-agent to begin with isn’t extremely instrumentally convergent?
Lemma 3 proves it does nothing here. One of the main points of AUP is that it seems to cripple the survival incentive, so I’m confused why this came to mind.
Utility != attainable utility. “how many dollars I have” != “how many additional dollars I could have within a year if I acted optimally”
Again, it’s not about utility change, but attainable utility change. The distinction is crucial.
I assert that we don’t need such a measure, for the reasons that I outlined in Utility Selection. That is, the content of the utilities seems to not actually be that important.
How can an agent become really powerful without also becoming really powerful, therefore getting penalized?
Not if the subagent is designed to not allow the increase in power. As in, the subagent is designed to maximise uA, but also to ensure the penalty term is as small as possible. The subagent is deliberately beyond the control of the initial agent.
(is this the issue we are disagreeing about, or have I misunderstood?)
I think that you think that the method uses expected utility (which it doesn’t), so we’re secretly talking about really different things. The method doesn’t actually break in the ways you’ve said so far (as far as I can tell), but an expected utility method would. Compare how the post’s examples break down under your model, and contrast with the math I provided for attainable utilities.
For example, the “EU difference over plans” model applies .316 penalty to disabling the off-switch (due to coincidentally capturing change in the agent’s actual utility for that time step, since it’s in the left square), yet my calculations show .95 penalty. Along these lines, AUP provably never tries to escape certain shutdown (that is, it expects to be shut down no matter which action it chooses at the present step), while an EU method would have no compunctions about doing so in many situations (as you pointed out earlier).
Anyways, in general, supposing that the agent instantaneously builds successors that do things it can’t control either impedes its ability to maximize other objectives (if the agent can’t hijack this new powerful agent), or improves it (if the agent can hijack this new powerful agent).
Here is a writeup of the problem I believe your method has: https://www.lesswrong.com/posts/6EMdmeosYPdn74wuG/wireheading-as-potential-problem-with-the-new-impact-measure