I don’t see how, if we also want it to be shutdown safe. After all, its model of us could be incorrect, so we might (to its surprise) want to shut it down—without its plans then having predictably higher impact than intended. To me, the prefix method seems more desirable in that way.
There isn’t in that case; however, from Daniel’s comment (which he was using to make a somewhat different point):
AUP thinks very differently about building a nuclear reactor and then adding safety features than it does about building the safety features and then the dangerous bits of the nuclear reactor
I find this reassuring. If we didn’t have this, we would admit plans which are only low impact if not interrupted.
I don’t see why that’s necessary, since we‘re still able to do both plans?
Looking at it from another angle, agents which avoid freely putting themselves (even temporarily) in instrumentally convergent positions seem safer with respect to unexpected failures, so it might also be desirable in this case even though it isn’t objectively impactful in the classical sense.
I’m just trying to figure out if things could be neater. Many low-impact plans accidentally share prefixes with high-impact plans, and it feels weird if many of our orders semi-randomly require tweaking N.
I see, so the AI will avoid prefixes of high impact plans. Can we make it avoid high impact plans only?
I don’t see how, if we also want it to be shutdown safe. After all, its model of us could be incorrect, so we might (to its surprise) want to shut it down—without its plans then having predictably higher impact than intended. To me, the prefix method seems more desirable in that way.
What’s the high impact if we shut down the AI while it’s downloading the movie?
There isn’t in that case; however, from Daniel’s comment (which he was using to make a somewhat different point):
I find this reassuring. If we didn’t have this, we would admit plans which are only low impact if not interrupted.
Is it possible to draw a boundary between Daniel’s case and mine?
I don’t see why that’s necessary, since we‘re still able to do both plans?
Looking at it from another angle, agents which avoid freely putting themselves (even temporarily) in instrumentally convergent positions seem safer with respect to unexpected failures, so it might also be desirable in this case even though it isn’t objectively impactful in the classical sense.
I’m just trying to figure out if things could be neater. Many low-impact plans accidentally share prefixes with high-impact plans, and it feels weird if many of our orders semi-randomly require tweaking N.
That’s a good point, and I definitely welcome further thought along these lines. I’ll think about it more as well!