Recently, I suggested the following broad model: The way you build things that are useful and do what you want is to understand how things work and put them together in a deliberate way. If you put things together randomly, they either won’t work, or will have unintended side effects. Under this model, relative to doing nothing, it is net positive to improve our understanding of AI systems, e.g. via transparency tools, even if it means we build powerful AI systems sooner (which reduces the time we have to solve alignment).
This post presents a counterargument: while understanding helps us make _useful_ systems, it need not help us build _secure_ systems. We need security because that is the only way to get useful systems in the presence of powerful external optimization, and the whole point of AGI is to build systems that are more powerful optimizers than we are. If you take an already-useful AI system, and you “make it more powerful”, this increases the intelligence of both the useful parts and the adversarial parts. At this point, the main point of failure is if the adversarial parts “win”: you now have to be robust against adversaries, which is a security property, not a usefulness property.
Under this model, transparency work need not be helpful: if the transparency tools allow you to detect some kinds of bad cognition but not others, an adversary simply makes sure that all of its adversarial cognition is the kind you can’t detect. Rohin’s note: Or, if you use your transparency tools during training, you are selecting for models whose adversarial cognition is the kind you can’t detect. Then, transparency tools could increase understanding and shorten the time to powerful AI systems, _without_ improving security.
Planned opinion:
I certainly agree that in the presence of powerful adversarial optimizers, you need security to get your system to do what you want. However, we can just not build powerful adversarial optimizers. My preferred solution is to make sure our AI systems are trying to do what we want , so that they never become adversarial in the first place. But if for some reason we can’t do that, then we could make sure AI systems don’t become too powerful, or not build them at all. It seems very weird to instead say “well, the AI system is going to be adversarial and way more powerful, let’s figure out how to make it secure”—that should be the last approach, if none of the other approaches work out. (More details in this comment.) Note that MIRI doesn’t aim for security because they expect powerful adversarial optimization—they aim for security because _any_ optimization <@leads to extreme outcomes@>(@Optimization Amplifies@). (More details in this comment.)
(If you want to comment on my opinion, please do so as a reply to the other comment I made.)
ETA: Added a sentence about MIRI’s beliefs to the opinion.
Planned newsletter summary:
Planned opinion:
(If you want to comment on my opinion, please do so as a reply to the other comment I made.)
ETA: Added a sentence about MIRI’s beliefs to the opinion.
Oh my, I never expected to be in the newsletter for writing an object level post about alignment. How exciting.