Hi Charbel, thanks for your interest, great question.
If the balance would favor offense, we would die anyway despite a successful alignment project, since there’s always either a bad actor or someone accidentally failing to align their takeover-level AI, in a world with many AGIs. (I tend to think about this as Murphy’s law for AGI). Therefore, if one claims that one’s alignment project reduces existential risk, they must think their aligned AI can somehow stop another unaligned AI (favorable offense/defense balance).
There are some other options:
Some believe the first AGI will take off to ASI straight away and will block other projects by default. I think that’s at least not certain, e.g. the labs don’t seem to believe so. Note also that blocking is illegal.
Some believe the first AGI will take off to pivotal act capability and do a pivotal act. I think there’s at least a chance that won’t happen. Note also that pivotal acts are illegal.
It could be that we regulate AI so that no unsafe projects can be built, using eg a conditional AI safety treaty. In this case, neither alignment, nor a positive offense defense balance are needed.
It could be that we get MAIM, mutually assured AI malfunction. In this case too, neither alignment nor a positive offense defense balance are needed.
Barring these options though, we seem to not only need AI alignment, bit also a positive offense defense balance.
Some more on the topic: https://www.lesswrong.com/posts/2cxNvPtMrjwaJrtoR/ai-regulation-may-be-more-important-than-ai-alignment-for
Thanks for writing this out! I see this as a possible threat model, and although I don’t think this is by far the only possible threat model, I do think it’s likely enough to prepare for. Below, I put a list of ~disagreements, or different ways to look at the problem which I think are as valid. Notably, I end up with technical alignment being much less of a crux, and regulation more of one.
This is a relatively minor point for me, but let me still make it: I think it’s not obvious that the same companies will remain in the lead. There are arguments for this, such as a decisive data availability advantage of the first movers. Still, seeing how quickly e.g. DeepSeek could (almost) catch up, I think it’s not unlikely that other companies, government projects, or academic projects will take over the lead. This likely partially has to do with me being skeptical about huge scaling being required for AGI (which is in the end trying to be a reproduction of a ten Watt device—us). I think unfortunately, this makes the risks a lot larger through governance being more difficult.
I’m not sure technical alignment would have been able to solve this scenario. Technically aligned systems could either be intent-aligned (seems most likely), value-aligned, or use coherent extrapolated volition. If they get the same power, I think this would likely still lead to a takeover, and still to a profoundly dystopian outcome, possibly with >90% of humanity dying.
This scenario is only one threat model. We should understand that there are at least a few more, also leading to human extinction. It would be a mistake to only focus on solving this one (and a mistake to only focus on solving technical alignment).
Since this threat model is relatively slow, gradual, and obvious (the public will see ~everything until the actual takeover happens), I’m somewhat less pessimistic about our chances (maybe “only” a few percent xrisk), because I think AI would likely get regulated, which I think could save us for at least decades.
I don’t think solving technical alignment would be sufficient to avoid this scenario, but I also don’t think it would be required. Basically, I don’t see solving technical alignment as a crux for avoiding this scenario.
I think the best way to avoid this scenario is traditional regulation: after model development, at the point of application. If the application looks too powerful, let’s not put an AI there. E.g. the EU AI act makes a start with this (although it’s important that such regulation would need to include the military as well, and would likely need ~global implementation—no trivial campaigning task).
Solving technical alignment (sooner) could actually be net negative for avoiding this threat model. If we can’t get an AI to reliably do what we tell it to do (current situation), who would use it in a powerful position? Solving technical alignment might open the door to applying AI at powerful positions, thereby enabling this threat model rather than avoiding it.
Despite these significant disagreements, I welcome the effort by the authors to write out their threat model. More people should do so. And I think their scenario is likely enough that we should put effort in trying to avoid it (although imo via regulation, not via alignment).