There is not presently a proof for finite U, which I tried to allude to it my first comment:
Now, why should this penalty be substantial, and why should it hold for finite sets U?
The points there are part of why I think it does hold for advanced agents.
This seems to assume some continuity-like property that I don’t have an intuition for.
This is because of the anti-survival incentive incentive and, by extension, the approval incentives: it seems implausible that the first plan which moves the agent somewhat towards its goal is also one that takes its survival chances from whatever they are normally all the way to almost 1. In fact, it seems that there is a qualitative jump from “optimizing somewhat in a way that we approve of that doesn’t change shutdown likelihood” to “acting incorrigibly and in an instrumentally convergent manner to maximize”.
There is not presently a proof for finite U, which I tried to allude to it my first comment:
The points there are part of why I think it does hold for advanced agents.
This is because of the anti-survival incentive incentive and, by extension, the approval incentives: it seems implausible that the first plan which moves the agent somewhat towards its goal is also one that takes its survival chances from whatever they are normally all the way to almost 1. In fact, it seems that there is a qualitative jump from “optimizing somewhat in a way that we approve of that doesn’t change shutdown likelihood” to “acting incorrigibly and in an instrumentally convergent manner to maximize”.