How dangerous is human-level AI?
I don’t expect anything here to be original; this is just me thinking things through, and perhaps this post will be useful as a relatively stand-alone argument.
Typically, you can envision many many more future states than there are future states you could bring about. You can imagine going grocery shopping at any location on earth, but realistically you can only go grocery shopping at places where there are already grocery stores. This is also true for worldstates which would be good for you. You can imagine running every specific business in the world (and thereby capturing the profit) but you can probably only run a very small fraction of these businesses. If you were more capable you could run all of them; but even at lower capability, you can imagine running all of them.
I don’t see why this wouldn’t be true for advanced AIs, ones that make plans to attain future states according to some criterion. So it seems very likely to me that AIs will realize that taking over the world[1] would be instrumentally valuable[2] long before they could attain it. That is, the AI will be able to formulate the plan of taking over the world, and will evaluate it as high-utility, but will also evaluate it as unattainable. What does this look like as we turn the capabilities knob? What might these AIs do just on the cusp of being able to attain world domination?
To get a better handle on this, imagine an AI that is human-level, but still an AI in the sense that it is software implemented on a modern computer. Of course, “human-level” is not one thing; advanced AI systems will likely be far above human-level at some tasks, even if they are below human-level at others. But for the sake of making it easier to think about, assume they are basically human-level at being able to understand how the world works and think up plans. We are not assuming the AI has anything like human values or goals; we are not assuming anything about its goals, other than that it has some, and is approximately an expected utility maximizer. For the sake of tradition, we can use paperclips as the proxy goal. Given that, what factors would make it easier or harder to take over the world?
To generate answers to this question, we can substitute the question, why don’t humans take over the world? Throughout history it has occurred to a lot of individual humans that taking over the world might be advantageous to them, and many have tried. But none have succeeded, and it is generally agreed to be an extremely difficult task. What factors make it difficult? And for each answer to that question, does that factor apply to the human-level AI?
For each factor, I’ll mark the paragraphs about humans[3] with an H:, and the ones about an AI system with AI:. (This sort of makes it look like a dialogue, but it is not.)
They don’t want to
It’s against their terminal values
H: It bears emphasizing that a great number of humans (I would argue the majority) simply don’t want to take over the world. They want everyone to “have a say” in what happens, they want to respect some form of human rights; they fundamentally value other humans attaining their own values.
Others may fundamentally value being in relaxed states, or living lives where they don’t have to be concerned with most of the world around them. Furthermore, many humans cannot be coherently described as consistently valuing things over time at all, or as having any particular optimization target. Mindspace is deep and wide.
AI: For our purposes, we are assuming an architecture similar to expected utility maximization, which is by default compatible with taking over the world.
It would require huge effort, which would be extremely unpleasant, which makes it not worth it
H: Being extremely ambitious is just super exhausting. You can’t rest, you have to constantly adapt your plans, you have to constantly update your worldview to account for paradigmatically novel challenges, et cetera. Even though many humans have access to effectively unlimited calories, this type of activity incurs some kind of resource expenditure that is arguably insurmountable for most humans.
AI: There’s no reason a priori to expect an AI system to have something analogous to this, although it seems plausible that we could design one to.
It would entail huge risks, and they are too risk-averse
H: If you could just push a button and then control the world, some humans would do so. But for many who would, the actual costs of attempting a take-over are far too high. They could get jailed or killed in the process, or lose their loved ones, or be hated by everyone for the rest of their lives, or go broke, et cetera.
AI: It’s entirely possible that the AI would decide that the risks were too great. If it deems itself sufficiently unlikely to succeed at taking over the world, and that the consequence would be something akin to death, then it may conclude that just making less paperclips the old-fashioned way was more total paperclips. Even if, for example, it knew that the engineers would shut it down at the end of the year, it wouldn’t inherently act in desperation against this result; it would just take the action that maximized paperclips. And maybe churning out 100k paperclips in one year with your one factory is the maximum EV you can get before getting shut down. It’s unclear how many human-level AIs would discover (or consider valid) an argument similar to Astronomical Waste.
They can’t
They aren’t smart enough, or otherwise skilled enough, or they don’t have enough starting resources
H: For many, the task of surviving is challenging enough. Even if they were trying to take over the world, their best efforts might only get to CEO, or billionaire, or world leader.
AI: This is also completely plausible for a human-level AI. Depending on the exact parameters, the challenge may simply be too hard for it. I would guess that this depends mostly on how much of the external world it has access to, or how much it could easily gain access to. And that’s just for starting to take over the world. See the “self-improvement” section below for the rest.
Humans are fragile
H: Even if you’re a 12 dimensional chess genius, you can always die in a car accident or something. All humans are inherently quite fragile, and about equally so.
AI: Whether the AI is similarly fragile depends on implementation. For a while, defeating the AI is as easy as deleting the files it’s made of. This is in fact far easier than killing a human. But if the AI can copy itself, then it suddenly becomes much less fragile.
It’s also probably pretty unlikely for the AI to be anything like mortally wounded; maybe the software had a random bug, but that bug is somewhat unlikely to “crash” the whole system, if it is indeed a whole complicated software system.
Human self-improvement is severely limited
H: We have very poor view-access to our internals (introspection), and even worse write-access (changing how we work). We cannot backup or reliably revert changes. Revealing our inner state is semi-involuntary (e.g. our faces show our thoughts).
AI: As mentioned before, if the AI can copy itself, then it has a dramatically faster path toward self-improvement than humans, because experimenting with changes is easier. Unlike with human neurons, there is at least in theory an easy way for it to “see” what’s on the inside. And it has a significant advantage in knowing that it was designed, unlike brains which evolved. So it won’t get distracted by trying to piece out which of its parts are for metabolizing, or the immune system, etc. That said, if its architecture is anything like modern deep learning, then it would have quite a hard time making improvements; it would have to be as skilled at ML as ML engineers, and its rate of improvement would be about as fast (at least in the beginning).
And since it’s only human-level, it would not survive for long if it did kill all humans. Before doing so, it would have to figure out how to keep the power going, how to fix broken hard drives, et cetera. It may be able to do this by making huge number of copies of itself and specializing, or by performing recursive self-improvement, but it’s unclear how tractable this is, especially how tractable it is to do without being detected before it’s done.
The “market” is efficient
H: By analogy with the efficient market hypothesis, you could say that if someone could have taken over the world, they already would have. Any emerging opportunities to gain an advantage are quickly snapped up by some humans, which tends to randomly distribute the gained advantage. No emerging opportunity is big enough to yield a decisive advantage, and no specific agent is able to seize enough opportunities in serial to gain an overall decisive advantage.
It is indeed often the case that those who seize an opportunity have an easier time seizing future opportunities; “the rich get richer” as they say. But no one has managed to gain a decisive advantage this way. Partly it’s just too hard. Partly they seize opportunities with the help of many others, who then share in the spoils and reduce the total advantage gained by the main person. And partly, when someone gains too much advantage, others collectively reduce their advantage (with e.g. increased taxes).
AI: This would also be a problem for a human-level AI. It could start some kind of online business just as easily as anyone else (assuming it’s given access to the internet), but it would have a hard time gaining any more resources from it than average. It might even have a difficult time leveraging any advantages it has over humans, because, given that a human-level AI exists, there are probably many other systems of similar capability, which have already started capturing the new opportunities this makes available.
They’re game-theoretically cooperating
H: This is distinct from altruistically cooperating. If it’s possible they could take over the world, but they’re not sure, and failing would incur huge risks, it can be worth it to mutually agree with the other humans that no one will try taking over the world, and you’ll instead continue co-existing with divided resources.
Since all humans are about equally smart, these conditions often arise.
AI: It feels somewhat hard to reason about this option for the AI. Given that it’s human-level, it indeed has a similar circumstance where it might be able to gain a significant advantage over humans, but it’s not very certain. But can it instead decide to game-theoretically cooperate?
Well, for one, it would be weird for the AI to propose this option to the humans. That would expose that it was even considering the option of “defecting”, which would pretty strongly incentivize the humans to “pull the plug” immediately, unless part of the deal was that the AI was threatening to do something bad if they did, and was promising not to if they instead agreed to cooperate… again, it’s unclear exactly how this situation would arise or go down.
There’s also a question of how the cooperation would be verified or enforced. We probably couldn’t inspect the AI’s code enough to be completely sure that it was going to cooperate with us, and it certainly couldn’t do the same to us. So perhaps it would be enforced the same way it is between humans, which is usually via iterated games. …And again, it seems like a tenuous and transient situation for the humans and the AI system to be able to go through iterated interactions where either one could have the legible option of “defecting” but not decisively winning.
Turning the dial up
It seems like the range of danger posed by human-level AI spans both above and below the range of danger posed by individual humans, and depends greatly on the surrounding circumstances. For many specifics, there is a small change that would make the difference between “can’t escape” and “escaped”.
But again, these are only useful constraints if the AI is about human-level. At higher levels of intelligence, plausibly even including high-percentile human intelligence, these factors may have easily discoverable workarounds. My intuition says that the risk increases pretty steeply as we turn up the dial. It’s not terribly hard for good security engineers to find security holes. The best ML engineers are far better than average engineers. The performance of ML models still seems to benefit pretty strongly from scaling, such that a below-human-level AI might become a far-above-human-level AI if you double the amount of money you’re willing to spend on compute.
Here are my overall take-aways from this train of thought.
Human-level AI is already dangerous, because individual humans can be dangerous, and being an AI has significant advantages
AI systems will very likely conceive of the option of taking over the world long before it is worth pursuing; perhaps we could make use of this fact?
There are a number of ways we could reduce the probability of near-human-level AIs trying to take over the world.
- ^
Throughout this post I’ll use phrases like “taking over the world” or “world domination” as colloquial shorthands. What I mean by those is the AI getting the world into any state where it has sole control over the future of the world, and humans no longer do. Often this is taken as literally killing all humans, which is a fine example to substitute in here, though I don’t have reason to commit to that.
- ^
Here I’m taking the concept of instrumental convergence as a given. If this is new to you, and you have questions like, “but why would AIs want to take over the world?”, then there are other resources good for answering those questions!
- ^
I’m going to consistently use the word “humans” to contrast with “AI”. I’m avoiding the word “people” because many would argue that some AIs deserve to be considered “people”.
“human-level AI” is a confusing term for at least a couple reasons: first, there is a gigantic performance range even if you consider only the top 1% of humanity and second it’s not clear that human-level general learning systems won’t be intrinsically superhuman because of things like scalable substrate and extraordinarily high bandwidth access (compared to eyes, ears, and mouths) to lossless information. That these apparent issues are not more frequently enumerated in the context of early AGI is confusing.
As far as I’m aware all serious attempts to take over the world have been by brute force. Historically there are messaging, travel, logistics etc latencies that make this very difficult within one’s lifetime even if potentially world-owning force is available or capable of being mustered. So the window for a single entity (human-level) to take over the world within its lifetime has probably only opened recently, and the number of externalities and internal abilities needed to line up to have a predictably large shot at success are probably many. Accordingly, even situations like Hitler sitting in control of a very powerful Reich which nominally might appear to enable a chance of world ownership are still too fraught with an unoptimized distribution of enabling factors to have any realistic chance of world ownership. There is also a grey area of whether an individual or some collective is responsible for the attempt. One might argue that trends ongoing for at least a few decades suggest that the USA is in a great position to take over the world if China (or someone else) doesn’t “break out” first. But with the way the USA is structured it may be difficult for any “human-level” individual entity to take credit for, or enjoy a firm grasp of the fruits of this conquest.
The closest thing to world domination that humans actually achieved is to be a dictator of a powerful empire. In other words, someone like Putin. Given that some people clearly can make it this far, what actually prevents them from scaling their power the remaining one or two orders of magnitude?
But before I start speculating on this, I realize that I actually do not even understand exactly how Putin got to his position. I mean, I have heard the general story but… if you put me in a simulator as a 30 years old Putin, with all the resources he had at given moment, I certainly would not be able to repeat his success. So my model of power is clearly incomplete.
I assume that in addition to all skills and resources it also involves a lot of luck. That the path to the top (of the country) seems obvious from hindsight, but at the beginning there were hundred other people with comparable resources and ambitions, who at some moment ended betting on the wrong side, or stabbed in the back. That if you would run 100 simulations starting with a 30 years old Putin, maybe even Putin himself would only win in one of them.
So the answer for a human-level AI trying to take over the world is that… if there is only one such AI, then statistically at some moment, something random will stop it. Like, no specific obstacle super difficult to overcome, but instead thousand obstacles where each has a 1% chance to stop it, and one of them does.
But build a thousand human-level AIs with the ambition to conquer the world, and maybe one of them will.
I think we can answer this one: domains of influence up to the scale of empires are mostly built by hundreds of years of social dynamics involving the actions of millions of other people. Rome wasn’t built in a day, nor was the Russian Federation. There is an inertia to any social organization that depends upon those who belong to it having belief in its continued existence, because human power almost entirely derives from other humans.
So the main impediment to world domination by humans at the moment is that there is no ready-made Dictator of Earth position to fill, and creating one is currently outside the scope of even the most powerful person’s ability to affect other people. Such a state could arise eventually, slowly incentivized through benefits of coordination over larger scales if nothing else.
With improved technology, including some in our own near future, the scope increases. There are many potential technologies that could substantially increase the scope of power even without AGI. With such AI the scope expands drastically: even if alignment is solved and AGI completely serves human will without greatly exceeding human capability, it means that some controlling entity can derive power without depending upon humans, which are expensive, slow to grow, resistant to change, and not really reliable.
I think this greatly increases the scope for world domination by a single entity, and could permanently end human civilization as we know it even if the Light-cone Dictator for Eternity actually started out as human.
Yes. If you want to achieve something big, you either need to get many details right, or rely on existing structures that already get the details right. Inventing all those details from scratch would be a lot of cognitive work, and something that seems correct might still turn wrong. Institutions already have the knowledge.
One of those problem is how to deal with unaligned humans. Building an army is not just about training people to obey orders and shoot, but also how to prevent soldiers from stealing the resources, defecting to enemy, or overthrowing you.
From this perspective, human-level AI could be powerful if you could make it 100% loyal, and then create multiple instances of it, because you would not need to solve the internal conflicts. For example, a robot army would not need to worry about rebellions, a robot company would not need to worry about employees leaving when a competitor offers them higher salary. If your robot production capacities are limited, you could put robots only to the critical positions. Not sure how exactly this would scale, how many humans would be as strong as a team of N loyal robots + M humans. Potentially the effect could be huge if e.g. replacing managers with robots removed the maze-like behavior.