If your action a lets you better u_A-maximize, then your reformulated 2) is exactly what Theorem 1 is about. That plan doesn’t exist.
For arbitrarily strong optimization, I suspect the agent might find a “very-interesting-plan” that will result in small-enough Penalties relative to the u_A values it achieves (overall yielding larger
u′′A values compared to “conventional” plans that we can imagine).
This is why we have intent verification—we indeed cannot come up with all possible workarounds beforehand, so we screen off interesting plans. If we can find strong formal support that intent verification weeding out bad impact workarounds, the question now becomes: would a normal u_A-maximizing plan also happen to somehow skirt the impact measure? This seems unlikely, but I left it as an open question.
It seems that to assert that this doesn’t work for normal behavior is to assert that there is somehow a way to accomplish your goals to an arbitrary degree at minimal cost of resources. But if this is the case, then that’s scaled away by a smaller ImpactUnit!
It’s true that as you gain optimization power, you’re better able to use your impact budget, but it’s still a budget, and we’re still starting small, so I don’t see why we would expect to be shocked by an N-incrementation’s effects, if we’re being prudent.
If your action a lets you better u_A-maximize, then your reformulated 2) is exactly what Theorem 1 is about. That plan doesn’t exist.
I’m confused about Theorem 1. When it says:
clearly at least one such u exists.
as I understand it, the theorem proves such a function generally exists. How do you know if such a function exists in the specific U that you chose?
It’s true that as you gain optimization power, you’re better able to use your impact budget, but it’s still a budget, and we’re still starting small, so I don’t see why we would expect to be shocked by an N-incrementation’s effects, if we’re being prudent.
This seems to assume some continuity-like property that I don’t have an intuition for. Suppose the agent follows the plan (∅,...∅) for some N. I have no intuition that incrementing N even slightly is safe.
There is not presently a proof for finite U, which I tried to allude to it my first comment:
Now, why should this penalty be substantial, and why should it hold for finite sets U?
The points there are part of why I think it does hold for advanced agents.
This seems to assume some continuity-like property that I don’t have an intuition for.
This is because of the anti-survival incentive incentive and, by extension, the approval incentives: it seems implausible that the first plan which moves the agent somewhat towards its goal is also one that takes its survival chances from whatever they are normally all the way to almost 1. In fact, it seems that there is a qualitative jump from “optimizing somewhat in a way that we approve of that doesn’t change shutdown likelihood” to “acting incorrigibly and in an instrumentally convergent manner to maximize”.
If your action a lets you better u_A-maximize, then your reformulated 2) is exactly what Theorem 1 is about. That plan doesn’t exist.
It seems that to assert that this doesn’t work for normal behavior is to assert that there is somehow a way to accomplish your goals to an arbitrary degree at minimal cost of resources. But if this is the case, then that’s scaled away by a smaller ImpactUnit!
It’s true that as you gain optimization power, you’re better able to use your impact budget, but it’s still a budget, and we’re still starting small, so I don’t see why we would expect to be shocked by an N-incrementation’s effects, if we’re being prudent.
I’m confused about Theorem 1. When it says:
as I understand it, the theorem proves such a function generally exists. How do you know if such a function exists in the specific U that you chose?
This seems to assume some continuity-like property that I don’t have an intuition for. Suppose the agent follows the plan (∅,...∅) for some N. I have no intuition that incrementing N even slightly is safe.
There is not presently a proof for finite U, which I tried to allude to it my first comment:
The points there are part of why I think it does hold for advanced agents.
This is because of the anti-survival incentive incentive and, by extension, the approval incentives: it seems implausible that the first plan which moves the agent somewhat towards its goal is also one that takes its survival chances from whatever they are normally all the way to almost 1. In fact, it seems that there is a qualitative jump from “optimizing somewhat in a way that we approve of that doesn’t change shutdown likelihood” to “acting incorrigibly and in an instrumentally convergent manner to maximize”.