Training of superintelligence is secretly adversarial
TL;DR it is possible to model the training of ASI[1] as an adversarial process because:
We train safe ASI on the foundation of security assumptions about the learning process, training data, inductive biases, etc.
We aim to train a superintelligence, i.e., a more powerful learning system than a human.
If our security assumptions are false, the ASI can learn this or learn facts downstream of the negation of security assumptions, because the ASI is smarter than us.
Therefore, if our security assumptions are false, we would prefer the ASI to be a less powerful learning system.
Consequently, we can model the situation where our security assumptions are false as an adversarial situation: we want the ASI to behave as if the security assumptions are true, while the ASI is optimized towards behavior according to truth.
Thus, security mindset is useful for alignment.
More explanation:
Many alignment thinkers are skeptical about security mindset being necessary for ASI alignment. Their central argument is that ASI alignment doesn’t have a source of adversarial optimization, i.e., adversary. I want to show that there is a more fundamental level of adversity than “assume that there is someone who wants to hurt you”.
Reminder of the standard argument
Feel free to skip this section if you remember the content of “Security Mindset and Ordinary Paranoia”
Classic writings try to communicate that, actually, you should worry about not adversarial optimization per se, but about optimization overall:
What matters isn’t so much the “adversary” part as the optimization part. There are systematic, nonrandom forces strongly selecting for partbicular outcomes, causing pieces of the system to go down weird execution paths and occupy unexpected states. If your system literally has no misbehavior modes at all, it doesn’t matter if you have IQ 140 and the enemy has IQ 160—it’s not an arm-wrestling contest. It’s just very much harder to build a system that doesn’t enter weird states when the weird states are being selected-for in a correlated way, rather than happening only by accident. The weirdness-selecting forces can search through parts of the larger state space that you yourself failed to imagine. Beating that does indeed require new skills and a different mode of thinking, what Bruce Schneier called “security mindset”.
Many people don’t find “weird selection forces” particularly compelling. It leaves you in a state of confusion even if you feel like you agree, because “weird forces” is not great gear-level explanation.
Another classic mental model is “planner-grader”: planner produces plans, grader evaluates them. If planner is sufficiently powerful and grader is not sufficiently robust, you are going to get lethally bad plan sooner or later. This mental model is unsatisfactory, because it’s not how modern DL models work.
I, personally, find standard form of argument in favor of security mindset intuitively reasonable, but I understand why so many are unpersuaded. So I’m going to reframe this argument step by step until it crosses inferential distances.
Optimization doesn’t need to have enemy to blow things up
There are multiple examples of this!
Let’s take evolution:
A decade after the controversy, a biologist had a fascinating idea. The mathematical conditions for group selection overcoming individual selection were too extreme to be found in Nature. Why not create them artificially, in the laboratory? Michael J. Wade proceeded to do just that, repeatedly selecting populations of insects for low numbers of adults per subpopulation. And what was the result? Did the insects restrain their breeding and live in quiet peace with enough food for all?
No; the adults adapted to cannibalize eggs and larvae...
- The Tragedy of Group Selectionism
Or, taking description from the paper itself:
That is, some of the B populations [selected for low population size] enjoy a higher cannibalism rate than the controls while other B populations have a longer mean developmental time or a lower average fecundity relative to the controls. Unidirectional group selection for lower adult population size resulted in a multivarious response among the B populations because there are many ways [highlight is mine] to achieve low population size.
The interesting part is that evolution is clearly not an adversary relative to you, it doesn’t know you even exist. You are just trying to optimize in one direction and evolution is trying to optimize in another and you are getting unexpected results.
My personal example which I used several times:
Let’s suppose you are the DM of a tabletop RPG campaign with homebrew rules.
You want to have Epic Battles in your campaign, where players must perform incredibly complex tactical and technical moves to win and dozens of characters will tragically die.
However, your players have discovered that the fundamental rules of your homebrew alchemy allow them to mix bread, garlic, and a first-level healing potion to create enormous explosions that kill all the enemies, including the final boss, so all of your plans for Epic Battles go to hell.
Again, you don’t have any actual adversaries here. You have a system, some parts of the system optimize in one direction, other in another direction and the result is not expected by either part.
Can we find something more specific than “optimization happens here”? If you optimize, say, a rocket engine, it can blow up, but the primary reason is not optimization.
We can try to compare these examples to usual cybersecurity setup. For example, you have a database and you want everyone to get access to it only if you give them authorized password. But if, for example, your syster has easily discoverable SQL injections, at least some users are goint to get unauthorised access.
Trying to systematize these examples I eventually come up with this intuition:
Systems that require security mindset have in common that they are learning systems, regardless of whether they are trying to work against you.
Now that we have this intuition, let’s lay out the logic.
Learning is a source of pressure
We can build a kind of isomorphism between all these examples:
We have learning system:
Population and gene pool in selective conditions
Players in game
Users of database
We have desirable learned policy:
Self-restricting in fertility
Epic Battles
Getting access only through admins
We have training process:
Selection
Learning of game rules
Providing users information about database
We have security assumption: desirable policies are learnable, undesirable—are not:
In the simplest case, we implicitly assume that undesirable policy doesn’t exist, so it can’t be learned, as in tabletop RPG example.
In more sophisticated case, we aware of existence of undesirable policies, but expect learning algorithm to not find them. For example, RSA assumes that even given full information logically sufficient to crack the code (ciphertext + public key), adversaries don’t have actual computing power to infer private key.
If you think about the whole metaphor of “pressure on assumptions” it’s easy to find it weird. Assumptions are not doors or locks, you can’t break them, unless they are false.
If your security assumptions (including implicit assumptions) are true, you are safe. You need to worry only if they are false. And if you have powerful learning system, for which your assumption “the undesirable policy is unlearnable given training process” is false, you naturally want your system to be less powerful.
This fact puts the whole system “security assumptions—actual learning system” in tension.
Superintelligence implies fragile security assumptions
It’s important to remember that actual task of AI alignment is alignment of superintelligences. By superintelligence I mean, for example, a system that can develop prototype general-purpose nanotech within a year given 2024-level tech.
The problem with superintelligence is that you (given current state of knowledge) have no idea how it works mechanistically. You have no expertise in superintelligent domains—you can’t create dataset of good (controllable) and bad (uncontrollable gray goo) nanotech designs and assign corresponding rewards, because why would you need superintelligence if you could?[2]
You can look at deception as special case of this. Suppose that you are training superintelligent programmer. You assign reward for performance and also penalize vulnerabilities in code. At some moment, system creates a program that has vulnerability which you can’t detect, so you give highest possible reward, reinforcing behavior “write code with vulnerabilities invisible for operator”. You are going to get unexpected behavior of system exactly because you didn’t know that your training process conveys into system information about how to deceive you.
Given all ignorance about system implied by superintelligence, it’s very likely that at least some of your assumptions are going to be false, and thus, will break eventually.
Thus, you need to have reasoning style that makes security assumptions less likely to break. I.e., security mindset.
This is a more elaborate version of this post.
Thanks to Nicholas Cross for valuable feedback. All errors are my own.
- 12 Feb 2024 18:28 UTC; 3 points) 's comment on Dreams of AI alignment: The danger of suggestive names by (
You seem to be assuming that the ability of the system to find out if security assumptions are false affects whether the falsity of the assumptions have a bad effect. Which is clearly the case for some assumptions—“This AI box I am using is inescapable”—but it doesn’t seem immediately obvious to me that this is generally the case.
Generally speaking, a system can have bad effects if made under bad assumptions (think a nuclear reactor or aircraft control system) even if it doesn’t understand what it’s doing. Perhaps that’s less likely for AI, of course.
And on the other hand, an intelligent system could be aware that an assumption would break down in circumstances that haven’t arrived yet, and not do anything about it (or even tell humans about it).
My point is that key assumptions are about learning system and its properties.
I.e., “if we train system to predict human text it will learn human cognition”. Or “if we train system on myopic task of predicting text it won’t learn long-term consequentialist planning”.