An AI defense-offense symmetry thesis

Epistemic status: Not well argued for, haven’t spent much time on it, and it’s not very worked out. This thesis is not new at all, and is implicit in a lot (most?) of the discussion about x-risk from AI. The intended contribution of this post is to state the thesis explicitly in a way that can help make the alignment problem clearer. I can imagine the thesis is too strong as stated.


I sometimes hear implicitly or explicitly in people’s skepticism of AI risk that we can just not build dangerous AI. I want to formulate a thesis that does part of the work of countering this idea.

If we want to be safe against misaligned AI, we need to defend the vulnerabilities that (a) misaligned AI system(s) could exploit that would pose an existential threat. We need aligned AI that defends against misaligned AI, in order to make sure such misaligned AI poses no threat, (either by ensuring they are not created, or never are able to accumulate the resources needed to be a threat).

In game theory terms: If we successfully build aligned AI systems, then we need to play an asymmetric game between humans with (aligned) defensive AI, versus potential (misaligned) offensive AI. It is not obvious that in an asymmetric game both sides need to have the same capabilities. The thesis I’m making is that the defensive AI needs to have at least all the capabilities that the offensive AI would need to pose an existential threat: The defensive AI needs to be basically capable of starting from a set of resources X, and doing all the damage that the offensive AI could do using resources X.

The/​an AI offense-defense symmetry thesis: In order for aligned AI systems to defend indefinitely against existential threat from misaligned AI without themselves harming humanity’s potential, that defensive AI needs to have broadly speaking all the capabilities of the misaligned AI systems that it defends against.

Note that the defender may need to have a higher (or lower) level of those capabilities than the attacker, and it may need additional kinds of capabilities as well that the attackers don’t have.

I don’t think this thesis is exactly right, but more or less right. I won’t really make a solid argument for it, but give some analogies and mechanisms:

Some analogies from offense-defense in human domains:

State security services who defend against terrorists:

  • The terrorists are persons with guns, the state security services are also persons with guns. To do damage one needs to be able to use guns to shoot people in a firefight. To defend, one still also needs to be able to use guns to shoot people in a firefight.

  • To do a lot of damage as a terrorist, you need to search for vulnerabilities to exploit. To defend against this, you need to search for vulnerabilities to fix/​defend against.

  • There are additional skills the defender needs to have, such as skills in tracking and monitoring suspects, which the attacker doesn’t need. The opposite direction seems to be less the case.

Computer security specialists defending against attacks:

  • Hackers and malware designers are people who understand computer systems and their vulnerabilities. IT Security professionals are also just people who understand computer systems and their vulnerabilities.

  • Empirically, black hat hackers or malware developers can contribute (once they change their motivations) to IT security, and security professionals could, if they wanted, use their knowledge to write malware.

Mechanisms why this thesis might be true

  • Offense and defense both require the same world-model. If the offender’s task is “find and exploit a vulnerability”, and the defense’s task is “find and defend all vulnerabilities”, then while their planning tasks are different, they both require the world-model of the domain in which they do this planning. E.g. Both hackers and computer security professionals need to actually understand programming, operating systems, networking, and how there can be vulnerabilities in these systems.

  • Offense and defense both need to search for vulnerabilities. Even defenders still need to be able to find vulnerabilities in their systems, especially when they can’t just respond ex-post to vulnerabilities found by attackers because a single failure is sufficiently damaging.

  • Symmetric subgames in asymmetric games. Some asymmetric games in the real world have symmetric subgames. E.g. Even though the conflict between an occupying power with a large military versus a guerilla force is an “asymmetric war”, at local points in the conflict, there are situations that are just one team of soldiers with guns fighting another team of soldiers with guns. Hence both the occupying power and the guerilla force need capabilities like marksmanship, tactical skill, coordination and so forth, for basically the same reason.

A mechanism why this might not be true

  • Defense can use abstraction to counter categories of attacks. Slightly metaphorically, the attackers problem is to prove “∃ vulnerability” (i.e. to find a vulnerability), while the defender’s problem is the opposite, to prove “∀ secure” (i.e. to ensure there is no vulnerability). But rather than the defender searching through all vulnerabilities and defending against them, it can simply execute a smaller amount of general countermeasures without understanding all the particular vulnerabilities and their exploits. This might require different capabilities.

For example, in computer security, rather than searching for vulnerabilities (e.g. by white hat hacking) and fixing them, one can work on formally verified software to rule out categories of vulnerabilities. This requires very different capabilities than hacking does.

Conclusion

This thesis would imply that in order to permanently defend against existential risk from AI without permanently crippling humanity, we need to develop aligned AI which has all the capabilities necessary to cause such an existential catastrophe. In particular, we cannot build weak AI to defend against strong AI.