is provably impossible for attainable utility values, assuming the u_A-maximizing plan itself isn’t inaction, and the U from the unbounded case.
Now, why should this penalty be substantial, and why should it hold for finite sets U?
Consider frictional resource costs. Can you really become a world-domineering paperclip maximizer without expending more than N•ImpactUnit (perhaps 10,000•Paperclip_Energy_Cost) of resources? Even if so, you use up your future paperclip construction budget on becoming it, so you would have been better off staying as you were.
Consider instrumental convergence. Can you really become a singleton without drastically changing your ability to accomplish the other U at any step along the way?
Consider approval incentives. Can you really seize power without at all shifting the chance we shut down the agent, triggering Corollary 1 / Theorem 3?
Consider U_A={u_A}. Can you really increase your ability to maximize u_A without increasing your ability to maximize u_A, or to wirehead u_1?
These are some of the informal reasons which make me believe that not only are all action-taking u_A maximizing plans penalized, but substantially penalized. It seems to be a fundamental property of power. A more formal investigation would certainly be good, but these are my thoughts right now.
Can you elaborate on why (in the long term) you expect that using a lot of random utility functions as the penalty set will be useful?
Because u_1 maximization ability seems to directly measure power and opportunity cost via wireheading capacity. In other words, it measures the agent’s ability to survive from that vantage point, which sermingly tracks directly with available resources and other measures of power, in addition to approval incentives.
2. is provably impossible for attainable utility values, assuming the u_A-maximizing plan itself isn’t inaction, and the U from the unbounded case.
I agree my formulation of (2) isn’t useful, it should be instead:
(2) The plan would yield a penalty of 0 for each time step (i.e. the ability to accomplish any of the other utility functions in U is unchanged by choosing the current action over ∅, throughout the execution of the plan).
Consider frictional resource costs. Can you really become a world-domineering paperclip maximizer without expending more than N•ImpactUnit (perhaps 10,000•Paperclip_Energy_Cost) of resources? Even if so, you use up your future paperclip construction budget on becoming it, so you would have been better off staying as you were.
Consider instrumental convergence. Can you really become a singleton without drastically changing your ability to accomplish the other U at any step along the way?
Consider approval incentives. Can you really seize power without at all shifting the chance we shut down the agent, triggering Corollary 1 / Theorem 3?
For arbitrarily strong optimization, I suspect the agent might find a “very-interesting-plan” that will result in small-enough Penalties relative to the uA values it achieves (overall yielding larger u′′A values compared to “conventional” plans that we can imagine).
If your action a lets you better u_A-maximize, then your reformulated 2) is exactly what Theorem 1 is about. That plan doesn’t exist.
For arbitrarily strong optimization, I suspect the agent might find a “very-interesting-plan” that will result in small-enough Penalties relative to the u_A values it achieves (overall yielding larger
u′′A values compared to “conventional” plans that we can imagine).
This is why we have intent verification—we indeed cannot come up with all possible workarounds beforehand, so we screen off interesting plans. If we can find strong formal support that intent verification weeding out bad impact workarounds, the question now becomes: would a normal u_A-maximizing plan also happen to somehow skirt the impact measure? This seems unlikely, but I left it as an open question.
It seems that to assert that this doesn’t work for normal behavior is to assert that there is somehow a way to accomplish your goals to an arbitrary degree at minimal cost of resources. But if this is the case, then that’s scaled away by a smaller ImpactUnit!
It’s true that as you gain optimization power, you’re better able to use your impact budget, but it’s still a budget, and we’re still starting small, so I don’t see why we would expect to be shocked by an N-incrementation’s effects, if we’re being prudent.
If your action a lets you better u_A-maximize, then your reformulated 2) is exactly what Theorem 1 is about. That plan doesn’t exist.
I’m confused about Theorem 1. When it says:
clearly at least one such u exists.
as I understand it, the theorem proves such a function generally exists. How do you know if such a function exists in the specific U that you chose?
It’s true that as you gain optimization power, you’re better able to use your impact budget, but it’s still a budget, and we’re still starting small, so I don’t see why we would expect to be shocked by an N-incrementation’s effects, if we’re being prudent.
This seems to assume some continuity-like property that I don’t have an intuition for. Suppose the agent follows the plan (∅,...∅) for some N. I have no intuition that incrementing N even slightly is safe.
There is not presently a proof for finite U, which I tried to allude to it my first comment:
Now, why should this penalty be substantial, and why should it hold for finite sets U?
The points there are part of why I think it does hold for advanced agents.
This seems to assume some continuity-like property that I don’t have an intuition for.
This is because of the anti-survival incentive incentive and, by extension, the approval incentives: it seems implausible that the first plan which moves the agent somewhat towards its goal is also one that takes its survival chances from whatever they are normally all the way to almost 1. In fact, it seems that there is a qualitative jump from “optimizing somewhat in a way that we approve of that doesn’t change shutdown likelihood” to “acting incorrigibly and in an instrumentally convergent manner to maximize”.
This appears as a very useful step forward!
To illustrate my main tentative concern with this approach, suppose the agent can find a plan such that:
(1) Following the plan yields high utility values for uA.
(2) For any utility function u∈U∖{uA}, following the plan would yield the same utility values for u as the plan (∅,...,∅).
(3) The plan is unaligned with what we really care about.
Am I missing something that makes this scenario implausible even with arbitrarily strong optimization?
Can you elaborate on why (in the long term) you expect that using a lot of random utility functions as the penalty set will be useful?
is provably impossible for attainable utility values, assuming the u_A-maximizing plan itself isn’t inaction, and the U from the unbounded case.
Now, why should this penalty be substantial, and why should it hold for finite sets U?
Consider frictional resource costs. Can you really become a world-domineering paperclip maximizer without expending more than N•ImpactUnit (perhaps 10,000•Paperclip_Energy_Cost) of resources? Even if so, you use up your future paperclip construction budget on becoming it, so you would have been better off staying as you were.
Consider instrumental convergence. Can you really become a singleton without drastically changing your ability to accomplish the other U at any step along the way?
Consider approval incentives. Can you really seize power without at all shifting the chance we shut down the agent, triggering Corollary 1 / Theorem 3?
Consider U_A={u_A}. Can you really increase your ability to maximize u_A without increasing your ability to maximize u_A, or to wirehead u_1?
These are some of the informal reasons which make me believe that not only are all action-taking u_A maximizing plans penalized, but substantially penalized. It seems to be a fundamental property of power. A more formal investigation would certainly be good, but these are my thoughts right now.
Because u_1 maximization ability seems to directly measure power and opportunity cost via wireheading capacity. In other words, it measures the agent’s ability to survive from that vantage point, which sermingly tracks directly with available resources and other measures of power, in addition to approval incentives.
I agree my formulation of (2) isn’t useful, it should be instead:
(2) The plan would yield a penalty of 0 for each time step (i.e. the ability to accomplish any of the other utility functions in U is unchanged by choosing the current action over ∅, throughout the execution of the plan).
For arbitrarily strong optimization, I suspect the agent might find a “very-interesting-plan” that will result in small-enough Penalties relative to the uA values it achieves (overall yielding larger u′′A values compared to “conventional” plans that we can imagine).
If your action a lets you better u_A-maximize, then your reformulated 2) is exactly what Theorem 1 is about. That plan doesn’t exist.
It seems that to assert that this doesn’t work for normal behavior is to assert that there is somehow a way to accomplish your goals to an arbitrary degree at minimal cost of resources. But if this is the case, then that’s scaled away by a smaller ImpactUnit!
It’s true that as you gain optimization power, you’re better able to use your impact budget, but it’s still a budget, and we’re still starting small, so I don’t see why we would expect to be shocked by an N-incrementation’s effects, if we’re being prudent.
I’m confused about Theorem 1. When it says:
as I understand it, the theorem proves such a function generally exists. How do you know if such a function exists in the specific U that you chose?
This seems to assume some continuity-like property that I don’t have an intuition for. Suppose the agent follows the plan (∅,...∅) for some N. I have no intuition that incrementing N even slightly is safe.
There is not presently a proof for finite U, which I tried to allude to it my first comment:
The points there are part of why I think it does hold for advanced agents.
This is because of the anti-survival incentive incentive and, by extension, the approval incentives: it seems implausible that the first plan which moves the agent somewhat towards its goal is also one that takes its survival chances from whatever they are normally all the way to almost 1. In fact, it seems that there is a qualitative jump from “optimizing somewhat in a way that we approve of that doesn’t change shutdown likelihood” to “acting incorrigibly and in an instrumentally convergent manner to maximize”.