Safety isn’t safety without a social model (or: dispelling the myth of per se technical safety)

As an AI researcher who wants to do technical work that helps humanity, there is a strong drive to find a research area that is definitely helpful somehow, so that you don’t have to worry about how your work will be applied, and thus you don’t have to worry about things like corporate ethics or geopolitics to make sure your work benefits humanity.

Unfortunately, no such field exists. In particular, technical AI alignment is not such a field, and technical AI safety is not such a field. It absolutely matters where ideas land and how they are applied, and when the existence of the entire human race is at stake, that’s no exception.

If that’s obvious to you, this post is mostly just a collection of arguments for something you probably already realize. But if you somehow think technical AI safety or technical AI alignment is somehow intrinsically or inevitably helpful to humanity, this post is an attempt to change your mind. In particular, with more and more AI governance problems cropping up, I’d like to see more and more AI technical staffers forming explicit social models of how their ideas are going to be applied.

If you read this post, please don’t try to read this post as somehow pro- or contra- a specific area of AI research, or safety, or alignment, or corporations, or governments. My goal in this post is to encourage more nuanced social models by de-conflating a bunch of concepts. This might seem like I’m against the concepts themselves, when really I just want clearer thinking about these concepts, so that we (humanity) can all do a better job of communicating and working together.

Myths vs reality

Epistemic status: these are claims that I’m confident in, assembled over 1.5 decades of observation of existential risk discourse, through thousands of hours of conversation. They are not claims I’m confident I can convince you of, but I’m giving it a shot anyway because there’s a lot at stake when people don’t realize how their technical research is going to be misapplied.

Myth #1: Technical AI safety and/​or alignment advances are intrinsically safe and helpful to humanity, irrespective of the state of humanity.

Reality: All technical advances in AI safety and/​or “alignment” can be misused by humans. There are no technical advances in AI that are safe per se; the safety or unsafety of an idea is a function of the human environment in which the idea lands.

Examples:

  • Obedience — AI that obeys the intention of a human user can be asked to help build unsafe AGI, such as by serving as a coding assistant. (Note: this used to be considered extremely sci-fi, and now it’s standard practice.)

  • Interpretability — Tools or techniques for understanding the internals of AI models will help developers better understand what they’re building and hence speed up development, possibly exacerbating capabilities races.

  • Truthfulness — AI that is designed to convey true statements to a human can also be asked questions by that human to help them build an unsafe AGI.

Myth #2: There’s a {technical AI safety VS AI capabilities} dichotomy or spectrum of technical AI research, which also corresponds to {making humanity more safe VS shortening AI timelines}.

Reality: Conflating these concepts has three separate problems with it, (a)-(c) below:

a) AI safety and alignment advances almost always shorten AI timelines.

In particular, the ability to «make an AI system do what you want» is used almost instantly by AI companies to help them ship AI products faster (because the AI does what users want) and to build internal developer tools faster (because the AI does what developers want).

(When I point this out, usually people think I’m somehow unhappy with how AI products have been released so quickly. On the contrary, I’ve been quite happy with how quickly OpenAI brought GPT-4 to the public, thereby helping the human public to better come to grips with the reality of ongoing and forthcoming AI advances. I might be wrong about this, though, and it’s not a load-bearing for this post. At the very least I’m not happy about Altman’s rush to build a $7TN compute cluster, nor with OpenAI’s governance issues.)

b) Per the reality of Myth #1 explained above, technical AI safety advances sometimes make humanity less safe.

c) Finally, {making humanity more safe VS shortening AGI timelines} is itself a false dichotomy or false spectrum.

Why? Because in some situations, shortening AGI timelines could make humanity more safe, such as by avoiding an overhang of over-abundant computing resources that AGI could abruptly take advantage of if it’s invented too far in the future (the “compute overhang” argument).

What to make of all this

The above points could feel quite morally disorienting, leaving you with a feeling something like: “What is even good, though?”

This disorientation is especially likely if you were on the hunt for a simple and reassuring view that a certain area of technical AI research could be easily verified as safe or helpful to humanity. Even if I’ve made clear arguments here, perhaps the resulting feeling of moral disorientation might make you want to reject or bounce off this post or the reasoning within it. It feels bad to be disoriented, so it’s more comfortable to go back to a simpler, more oriented worldview of what kind of AI research is “the good kind”.

Unfortunately, the real world is a complex sociotechnical system that’s confusing, not only because of its complexity, but also because the world can sometimes model you and willfully misuse you, your ideas, or your ambitions. Moreover, I have no panacea to offer for avoiding this. I would have liked to write a post that offers one weird trick to avoid being confused by which areas of AI are more or less safe to advance, but I can’t write that post. As far as I know, the answer is simply that you have to model the social landscape around you and how your research contributions are going to be applied.

In other words, it matters who receives your ideas, and what they choose to do with those ideas, even when your ideas are technical advances in AI safety or “alignment”. And if you want to make sure your ideas land in a way that helps and doesn’t harm humanity, you just have to think through how the humans are actually going to use your ideas. To do a good job of that, you have to carefully think through arguments and the meanings of words (“alignment”, “safety”, “capabilities”, etc.) before conflating important load-bearing concepts for steering the future of AI.

Avoiding such conflations is especially hard because forming a large alliance often involves convincing people to conflate a bunch of concepts they care about in order to recruit you to their alliances. In other words, you should in general expect to see large alliances of people trying to convince you to conflate value-laden concepts (e.g., “technical safety”, “alignment”, “security”, “existential safety”) in order to join them (i.e., conflationary alliances).

Recap of key points

  • Social /​ human factors are crucial to whether any given technical advancement is safe or good for humanity.

  • Technical AI safety is not always safe for humanity.

  • All technical advances in AI safety and/​or “alignment” can be misused by human users and developers.

  • AI safety and alignment advances almost always shorten AI timelines, by boosting profits and internal developer tools.

  • Some ways of shortening AGI timelines can make humanity more safe.

  • There are powerful social forces and/​or selection pressures to create alliances that conflate important concepts in AI (e.g., “technical safety”, “alignment”, “security”, “existential safety”), so as to build powerful big-tent alliances around the resulting conflated concepts (i.e., conflationary alliances). Thus, it may take an active effort to not lose track of distinct concepts when people and institutions around you are predictably trying to conflate them.

Crossposted to EA Forum (92 points, 3 comments)