Comparing AI Safety-Capabilities Dilemmas to Jervis’ Cooperation Under the Security Dilemma
I’ve been skimming some things about the Security Dilemma (specifically Offense-Defense Theory) while looking for analogies for strategic dilemmas in the AI landscape.
I want to describe a simple comparison here, lightly held (and only lightly studied)
“AI Capabilities”—roughly, the ability to use AI systems to take (strategically) powerful actions—as “Offense”
“AI Safety”—roughly, that AI systems under control and use do not present a catastrophic/existential threat—as “Defense”
Now re-creating the two values (each) of the two variables of Offense-Defense Theory:
“Capabilities Advantaged”—Organizations who disproportionately invest in capabilities get strategic advantage
“Safety Advantaged”—Organizations who disproportionately invest in safety get strategic advantage.[1]
“Capabilities Posture Not Distinguishable from Safety Posture”—from the outside, you can’t credibly determine how an organization is investing Capabilities-vs-Safety. (This is the case if very similar research methods/experiments are used to study both, or that they are not technologically separable. My whole interest in this is in cultivating and clarifying this one point. This exists in the security dilemma as well inasmuch as defensive missile interceptors can be used to offensively strike ranged targets, etc.)
“Capabilities Posture Distinguishable from Safety Posture”—it’s credibly visible from the outside whether an organization is disproportionately invested in Safety
Finally, we can sketch out the Four Worlds of Offense-Defense Theory
Capabilities Advantaged / Not-Distinguishable : Doubly dangerous (Jervis’ words, but I strongly agree here). I also think this is the world we are in.
Safety Advantaged / Not-Distinguishable : There exists a strategic dilemma, but it’s possible that non-conflict equilibrium can be reached.
Capabilities Advantaged / Distinguishable : No security dilemma, and other strategic responses can be made to obviously-capabilities-pursuing organizations. We can have warnings here which trigger or allow negotiations or other forms of strategic resolution.
Safety Advantaged / Distinguishable : Doubly stable (Jervis’ words, also strongly agree).
From the paper linked in the title:
My Overall Take
If this analogy fits, I think we’re in World 1.
The analogy feels weak here, there’s a bunch of mis-fits from the original (very deep) theory
I think this is neat and interesting but not really a valuable strategic insight
I hope this is food for thought
- ^
This one is weird, and hard for me to make convincing stories about the real version of this, as opposed to seeming to have a Safety posture—things like “recruiting”, “public support”, “partnerships”, etc all can come from merely seeming to adopt the Safety Posture. (Though, this is actually a feature of the security dilemma and offense-defense theory, too)
I like pointing out this confusion. Here’s a grab-bag of some of the things I use it for, to try to pull them apart:
actors/institutions far away from the compute frontier produce breakthroughs in AI/AGI tech (juxtaposing “only the top 100 labs” vs “a couple hackers in a garage”)
once a sufficient AI/AGI capability is reached, that it will be quickly optimized to use much less compute
amount of “optimization pressure” (in terms of research effort) pursuing AI/AGI tech, and the likelihood that they missed low-hanging fruit
how far AI/AGI research/products are away from the highest-value-marginal-use of compute, and how changes making AI/AGI the biggest marginal profit of compute would change things
the legibility of AI/AGI research progress (e.g. in a high-overhang world, small/illegible labs can make lots of progress)
the likelihood of compute-control interventions to change the trajectory of AI/AGI research
the asymmetry between compute-to-build (~=training) and compute-to-run (~=inference) of advanced AI/AGI technology
probably also others im forgetting