If your action a lets you better u_A-maximize, then your reformulated 2) is exactly what Theorem 1 is about. That plan doesn’t exist.
I’m confused about Theorem 1. When it says:
clearly at least one such u exists.
as I understand it, the theorem proves such a function generally exists. How do you know if such a function exists in the specific U that you chose?
It’s true that as you gain optimization power, you’re better able to use your impact budget, but it’s still a budget, and we’re still starting small, so I don’t see why we would expect to be shocked by an N-incrementation’s effects, if we’re being prudent.
This seems to assume some continuity-like property that I don’t have an intuition for. Suppose the agent follows the plan (∅,...∅) for some N. I have no intuition that incrementing N even slightly is safe.
There is not presently a proof for finite U, which I tried to allude to it my first comment:
Now, why should this penalty be substantial, and why should it hold for finite sets U?
The points there are part of why I think it does hold for advanced agents.
This seems to assume some continuity-like property that I don’t have an intuition for.
This is because of the anti-survival incentive incentive and, by extension, the approval incentives: it seems implausible that the first plan which moves the agent somewhat towards its goal is also one that takes its survival chances from whatever they are normally all the way to almost 1. In fact, it seems that there is a qualitative jump from “optimizing somewhat in a way that we approve of that doesn’t change shutdown likelihood” to “acting incorrigibly and in an instrumentally convergent manner to maximize”.
I’m confused about Theorem 1. When it says:
as I understand it, the theorem proves such a function generally exists. How do you know if such a function exists in the specific U that you chose?
This seems to assume some continuity-like property that I don’t have an intuition for. Suppose the agent follows the plan (∅,...∅) for some N. I have no intuition that incrementing N even slightly is safe.
There is not presently a proof for finite U, which I tried to allude to it my first comment:
The points there are part of why I think it does hold for advanced agents.
This is because of the anti-survival incentive incentive and, by extension, the approval incentives: it seems implausible that the first plan which moves the agent somewhat towards its goal is also one that takes its survival chances from whatever they are normally all the way to almost 1. In fact, it seems that there is a qualitative jump from “optimizing somewhat in a way that we approve of that doesn’t change shutdown likelihood” to “acting incorrigibly and in an instrumentally convergent manner to maximize”.