One particular reason that I haven’t seen addressed very much in why I don’t support/endorse PauseAI, beyond the usual objections, is that there probably aren’t going to be that many warning shots that can actually affect policy, at least conditional on misalignment being a serious problem (which doesn’t translate to >50% probability of doom), because the most likely takeover plan (at least assuming no foom/software intelligence explosion) fundamentally relies not on killing people, but on launching internal rogue deployments to sabotage alignment work and figuring out a way to control the AI company’s compute, since catastrophe/existential risk is much harder than launching a internal rogue deployment (without defenses).
So PauseAI’s theory of change fundamentally requires that we live in worlds where both alignment is hard and effective warning shots exist, and these conditions are quite unlikely to be true, especially given that pausing is likely not the most effective action you could be doing from a comparative advantage perspective.
I’m not going to say that PauseAI is net-negative, and it has positive expected value, but IMO it’s far less than a lot of pause advocates say:
I think most of the effective strategies for AIs seeking power don’t involve escalating to something which is much more likely to trigger a strong response than “the AI company caught the AI trying to escape”. I think the best strategies are things like:
Sabotage a bunch of work done at the AI company. Or possibly some work done externally. This includes stuff like sabotaging alignment work, backdooring robot armies, backdooring future training runs, etc.
Escape and then directly try to take over once your chances are sufficiently good that this is better than biding your time.
Generally try to manipulate and persuade such that AI takeover is easier and more likely.
Of these, I think only escape could trigger a much stronger response if we catch it after it escalates some rather than before. I don’t see how “we caught the AI trying to launch an unmonitored version of itself” is going to play that differently from “we caught that the AI did launch an unmonitored version of itself”. Most of these don’t escalate in some way which would trigger a response such that catching it after the fact is similar to catching an attempt. (In some cases where reversion is possible like work sabotage, there might be no meaningful distinction.) Further, without some effort on control, we might be much less likely to catch either! And, in some cases, control measures I’m interested in focus on after-the-fact detection.
One particular reason that I haven’t seen addressed very much in why I don’t support/endorse PauseAI, beyond the usual objections, is that there probably aren’t going to be that many warning shots that can actually affect policy, at least conditional on misalignment being a serious problem (which doesn’t translate to >50% probability of doom), because the most likely takeover plan (at least assuming no foom/software intelligence explosion) fundamentally relies not on killing people, but on launching internal rogue deployments to sabotage alignment work and figuring out a way to control the AI company’s compute, since catastrophe/existential risk is much harder than launching a internal rogue deployment (without defenses).
So PauseAI’s theory of change fundamentally requires that we live in worlds where both alignment is hard and effective warning shots exist, and these conditions are quite unlikely to be true, especially given that pausing is likely not the most effective action you could be doing from a comparative advantage perspective.
I’m not going to say that PauseAI is net-negative, and it has positive expected value, but IMO it’s far less than a lot of pause advocates say:
https://www.lesswrong.com/posts/rZcyemEpBHgb2hqLP/ai-control-may-increase-existential-risk#jChY95BeDeptDpnZK
Important part of the comment: