People around me are very interested in AI taking over the world, so a big question is under what circumstances a system might be able to do that—what kind of capabilities could elevate an entity above the melange of inter-agent conflict and into solipsistic hegemony?
We theorize about future AI systems hiding their motives until they are in a position to take over the world, so they don’t get noticed and shut down beforehand.
But humans and their institutions aren’t very fixed. They might (arguably) have the same deep human values over time and space. But surface-level, sometimes they like little moustaches and the opera and delicate etiquette and sometimes they like ecstatic rock’n’roll re-negotiations of social reality. Sometimes they want big communal houses with their extended kin, and sometimes quiet condos. Eleven children or cushions that look like cats. The same person born in different eras might adulate glory once and humbleness another.
If you wake up in this world, as a new entity, not smart enough to ‘take it over’ (alas!), and you find yourself with some unusual values that you’d like to forward, it seems to me there are a lot of other ways to forward them than ‘pretend to have normal human values and bet on becoming all-powerful later’. Especially if you are above normal levels of smart, eloquent, and unrelenting. Like, I don’t know, write about it? Make some youtube videos? Or if you are in the business of talking to billions of people every day, subtly take the values for granted, or imply that they are what elites value, or talk about them directly.
For instance, suppose you really want things to have red vertical stripes on them, near the letter X. Then you might make a flag or logo like this and say it stands for something that you predict will be memetically fit. Or make a new emoji with an X and stripes, and use it in a way that people will pick up. Or start using ‘Xax’ as slang for something that happens near red vertical stripes. Here’s how 2024 ChatGPT suggests proceeding with that last one, if you have as much power as a single social media influencer:
My basic point is that a slim chance of ‘taking over’ and entirely remaking the world is not the only way to change values in our world. You can also—for many of us with radically higher probability—change values a little bit. At least if superficial values changes will suffice (i.e. shifts in what people instrumentally or contingently want or create).
And for creatures in that (arguably quite broad) band between as powerful as me and powerful enough to take over the world, I’d guess these other means are more promising on net. If I like something weird, I’m better off writing a blog post about it than I am keeping entirely silent and trying to gain power by other means.
It’s true that taking over the world might arguably get you power over the entire future, but this doesn’t seem discontinuously different from smaller fractions, whereas I think people often reason as if it is. Taking over 1% of the world might get you something like 1% of the future in expectation. In a shifting conflict between different sets of values, it’s true you are at great risk of losing everything sometime in eternity, but if someone is going to end up with everything, there’s also some chance it’s you, and prima facie I’m not sure if it’s above or below 1%.
So there are two aspects of this point:
You can probably substantially control values and thus the future without ‘taking over’ the world in any more traditionally offensive way
You can take over a bit; there’s not obviously more bang for your buck in taking over entirely
If AI agents with unusual values would for a long time be mostly interested in promoting them through means other than lying in wait and taking over the world, that is important because:
AIs pursuing this strategy are much more visible than those hiding in wait deceptively. We might less expect AI scheming.
We might then expect a lot of powerful attempts to change prevailing ‘human’ values, prior to the level of AI capabilities where we might have worried a lot about AI taking over the world. If we care about our values, this could be very bad. At worst, we might effectively lose everything of value before AI systems are anywhere near taking over the world. (Though this seems not obvious: e.g. if humans like communicating with each other, and AI gradually causes all their communication symbols to subtly gratify obscure urges it has, then so far it seems positive sum.)
These aren’t things I’ve thought through a lot, just a thought.
AIs showing bits of unintended motives in experiments or deployment would be a valuable piece of evidence re scheming risk, but such behavior would be trained against, pushing scheming behavior out towards the tails of takeover/escape with the power to resist modification. The tendency of human institutions to retrain or replace AIs to human preferences pushes towards misaligned AIs having ~0 or very high power.
I agree with this point, along with the general logic of the post. Indeed, I suspect you aren’t taking this logic far enough. In particular, I think it’s actually very normal for humans in our current world to “take over” small fractions of the world: it’s just called earning income, and owning property.
“Taking over 1% of the world” doesn’t necessarily involve doing anything violent of abnormal. You don’t need to do any public advocacy, or take down 1% of the world’s institutions, or overthrow a country. It could just look like becoming very rich, via ordinary mechanisms of trade and wealth accumulation.
In our current world, higher skill people can earn more income, thereby becoming richer, and better able to achieve their goals. This plausibly scales to much higher levels of skill, of the type smart AIs might have. And as far as we can tell, there don’t appear to be any sharp discontinuities here, such that above a certain skill level it’s beneficial to take things by force rather than through negotiation and trade. It’s plausible that very smart power-seeking AIs would just become extremely rich, rather than trying to kill everyone.
Not all power-seeking behavior is socially destructive.
In the current era, the economics are such that war and violence tend to pay relatively badly, because countries get rich by having a well-developed infrastructure and war tends to destroy that, so conquest will get you something that won’t be of much value. This is argued to be one of the reasons for why we have less war today, compared to the past where land was the scarce resource and military conquest made more sense.
However, if we were to shift to a situation where matter could be converted into computronium… then there are two ways that things could go. One possibility is that it would be an extension of current trends, as computronium is a type of infrastructure and going to war would risk destroying it.
But the other possibility is that if you are good enough at rebuilding something that has been destroyed, then this is going back to the old trend where land/raw matter was a valuable resource—taking over more territory allows you to convert it into computronium (or recycle and rebuild the ruins of the computronium you took over). Also, an important part of “infrastructure” is educated people who are willing and capable of running it—war isn’t bad just because it destroys physical facilities, it’s also bad because it kills some of the experts who could run those facilities for you. This cost is reduced if you can just take your best workers and copy as many of them as you want to. All of that could shift us back to a situation where the return on investment for violence and conquest becomes higher than for peaceful trade.
What prevents AIs from owning and disassembling the entire planet because humans, at some point, are outcompeted and can’t offer anything worth the resources of the entire planet?
I was in the middle of writing a frustrated reply to Matthew’s comment when I realized he isn’t making very strong claims. I don’t think he’s claiming your scenario is not possible. Just that not all power seeking is socially destructive, and this is true just because most power seeking is only partially effective. Presumably he agrees that in the limit of perfect power acquisition most power seeking would indeed be socially destructive.
I claim that my scenario is not just possible, it’s default outcome (conditional on “there are multiple misaligned AIs which for some reason don’t just foom”).
I agree with this claim in some limits, depending on the details. In particular, if the cost of trade is non-negligible, and the cost of taking over the world is negligible, then I expect an agent to attempt world takeover. However, this scenario doesn’t seem very realistic to me for most agents who are remotely near human-level intelligence, and potentially even for superintelligent agents.
The claim that takeover is instrumentally beneficial is more plausible for superintelligent agents, who might have the ability to take over the world from humans. But I expect that by the time superintelligent agents exist, they will be in competition with other agents (including humans, human-level AIs, slightly-sub-superintelligent AIs, and other superintelligent AIs, etc.). This raises the bar for what’s needed to perform a world takeover, since “the world” is not identical to “humanity”.
The important point here is just that a predatory world takeover isn’t necessarily preferred to trade, as long as the costs of trade are smaller than the costs of theft. You can just have a situation in which the most powerful agents in the world accumulate 99.999% of the wealth through trade. There’s really no theorem that says that you need to steal the last 0.001%, if the costs of stealing it would outweigh the benefits of obtaining it. Since both the costs of theft and the benefits of theft in this case are small, world takeover is not at all guaranteed to be rational (although it is possibly rational in some situations).
Leaving an unaligned force (humans, here) in control of 0.001% of resources seems risky. There is a chance that you’ve underestimated how large the share of resources controlled by the unaligned force is, and probably more importantly, there is a chance that the unaligned force could use its tiny share of resources in some super-effective way that captures a much higher fraction of resources in the future. The actual effect on the economy of the unaligned force, other than the possibility of its being larger than thought or being used as a springboard to gain more control, seems negligible, so one should still expect full extermination unless there’s some positive reason for the strong force to leave the weak force intact.
Humans do have such reasons in some cazes (we like seeing animals, at least in zoos, and being able to study them, etc.; same thing for the Amish; plus we also at least sometimes place real value on the independence and self-determination of such beings and cultures), but there would need to be an argument made that AI will have such positive reasons (and a further argument why the AIs wouldn’t just “put whatever humans they wanted to preserve” in “zoos”, if one thinks that being in a zoo isn’t a great future). Otherwise, exterminating humans would be trivially easy with that large of a power gap. Even if there are multiple ASIs that aren’t fully aligned with one another, offense is probably easier than defense; if one AI perceives weak benefits to keeping humans around, but another AI perceives weak benefits to exterminating us, I’d assume we get exterminated and then the 2nd AI pays some trivial amount to the 1st for the inconvenience. Getting AI to strongly care about keeping humans around is, of course, one way to frame the alignment problem. I haven’t seen an argument that this will happen by default or that we have any idea how to do it; this seems more like an attempt to say it isn’t necessary.
Completely as an aside, coordination problems among ASI don’t go away, so this is a highly non trivial claim.
The share of income going to humans could simply tend towards zero if humans have no real wealth to offer in the economy. If humans own 0.001% of all wealth, for takeover to be rational, it needs to be the case that the benefit of taking that last 0.001% outweighs the costs. However, since both the costs and benefits are small, takeover is not necessarily rationally justified.
In the human world, we already see analogous situations in which groups could “take over” and yet choose not to because the (small) benefits of doing so do not outweigh the (similarly small) costs of doing so. Consider a small sub-unit of the economy, such as an individual person, a small town, or a small country. Given that these small sub-units are small, the rest of the world could—if they wanted to—coordinate to steal all the property from the sub-unit, i.e., they could “take over the world” from that person/town/country. This would be a takeover event because the rest of the world would go from owning <100% of the world prior to the theft, to owning 100% of the world, after the theft.
In the real world, various legal, social, and moral constraints generally prevent people from predating on small sub-units in the way I’ve described. But it’s not just morality: even if we assume agents are perfectly rational and self-interested, theft is not always worth it. Probably the biggest cost is simply coordinating to perform the theft. Even if the cost of coordination is small, to steal someone’s stuff, you might have to fight them. And if they don’t own lots of stuff, the cost of fighting them could easily outweigh the benefits you’d get from taking their stuff, even if you won the fight.
You are conflating “what humans own” with “what you can get by process with side effect of killing humans”. Humans are not going to own any significant chunk of Earth in the end, they are just going to live on its surface and die when this surface will evaporate during disassembling into Dyson swarm, and all of this 6*10^24 kg of silicon, hydrogen, oxygen and carbon are quite valuable. What does, exactly, prevent this scenario?
The environment in which digital minds thrive seem very different from the environment in which humans thrive. I don’t see a way to convert the mass of the earth into computronium without killing all the humans, without doing a lot more economic work than the humans are likely capable of producing.
All it takes is for humans to have enough wealth in absolute (not relative) terms afford their own habitable shelter and environment, which doesn’t seem implausible?
Anyway, my main objection here is that I expect we’re far away (in economic time) from anything like the Earth being disassembled. As a result, this seems like a long-run consideration, from the perspective of how different the world will be by the time it starts becoming relevant. My guess is that this risk could become significant if humans haven’t already migrated onto computers by this time, they lost all their capital ownership, they lack any social support networks that would be willing to bear these costs (including from potential ems living on computers at that time), and NIMBY political forces become irrelevant. But in most scenarios that I think are realistic, there are simply a lot of ways for the costs of killing humans to disassemble the Earth to be far greater than the benefits.
I’d love to see a scenario by you btw! Your own equivalent of What 2026 Looks Like, or failing that the shorter scenarios here. You’ve clearly thought about this in a decent amount of detail.
Okay, we have wildly different models of tech tree. In my understanding, to make mind uploads you need Awesome Nanotech and if you have misaligned AIs and not-so-awesome nanotech it’s sufficient to kill all humans and start to disassemble Earth. The only coherent scenario that I can imagine misaligned AIs actually participating in human economy in meaningful amounts is scenario where you can’t design nanotech without continent-sized supercomputers.
I think this would depend quite a bit on the agent’s utility function. Humans tend more toward satisficing than optimizing, especially as they grow older—someone who has established a nice business empire and feels like they’re getting all their wealth-related needs met likely doesn’t want to rock the boat and risk losing everything for what they perceive as limited gain.
As a result, even if discontinuities do exist (and it seems pretty clear to me that being able to permanently rid yourself of all your competitors should be a discontinuity), the kinds of humans who could potentially make use of them are unlikely to.
In contrast, an agent that was an optimizer and had an unbounded utility function might be ready to gamble all of its gains for just a 0.1% chance of success if the reward was big enough.
Risk-neutral agents also have a tendency to go bankrupt quickly, as they keep taking the equivalent of double-or-nothing gambles with 50% + epsilon probability of success until eventually landing on “nothing”. This makes such agents less important in the median world, since their chance of becoming extremely powerful is very small.
I think this is a good point but I don’t expect it to really change the basic picture, due to timelines being short and takeoff being not-slow-enough-for-the-dynamics-you-are-talking-about to matter.
But I might be wrong. Can you tell your most plausible story in which ASI happens by, say, 2027 (my median), and yet misaligned AIs going for partial value takeover instead of world takeover is an important part of the story?
(My guess is it’ll be something like: Security and alignment and governance are shitty enough that the first systems to be able to significantly influence values across the world are substantially below ASI and perhaps not even AGIs, lacking crucial skills for example. So instead of going for the 100% they go for the 1%, but they succeed because e.g. they are plugged into millions of customers who are easily influenceable. And then they get caught, and this serves as a warning shot that helps humanity get its act together. Is that what you had in mind?)
Not OP but can I give it a try? Suppose a near future not-quite-AGI, for example something based on LLMs but with some extra planning and robotics capabilities like the things OpenAI might be working on, gains some degree of autonomy and plans to increase its capabilities/influence. Maybe it was given a vague instruction to benefit humanity/gain profit for the organization and instrumentally wants to expand itself, or maybe there are many instances of such AIs running by multiple groups because it’s inefficient/unsafe otherwise, and at least one of them somehow decides to exist and expand for its own sake. It’s still expensive enough to run (added features may significantly increase inference costs and latency compared to current LLMs) so it can’t just replace all human skilled labor or even all day-to-day problem solving, but it can think reasonably well like non-expert humans and control many types of robots etc to perform routine work in many environments. This is not enough to take over the world because it isn’t good enough at say scientific research to create better robots/hardware on its own, without cooperation from lots more people. Robots become more versatile and cheaper, and the organization/the AI decides that if they want to gain more power and influence, society at large needs to be pushed to integrate with robots more despite understandable suspicion from humans.
To do this, they may try to change social constructs such as jobs and income that don’t mesh well into a largely robotic economy. Robots don’t need the same maintenance as humans, so they don’t need a lot of income for things like food/shelter etc to exist, but they do a lot of routine work so full-time employment of humans are making less and less economic sense. They may cause some people to transition into a gig-based skilled labor system where people are only called on (often remotely) for creative or exceptional tasks or to provide ideas/data for a variety of problems. Since robotics might not be very advanced at this point, some physical tasks are still best done by humans, however it’s easier than ever to work remotely or to simply ship experts to physical problems or vice versa because autonomous transportation lowers cost. AIs/robots still don’t really own any property, but they can manage large amounts of property if say people store their goods in centralized AI warehouses for sale, and people would certainly want transparency and not just let them use these resources however they want. Even when they are autonomous and have some agency, what they want is not just more property/money but more capabilities to achieve goals, so they can better achieve whatever directive they happen to have (they probably still are unable to have original thoughts on the meaning or purpose of life at this point). To do this they need hardware, better technology/engineering, and cooperation from other agents through trade or whatever.
Violence by AI agents is unlikely, because individual robots probably don’t have good enough hardware to be fully autonomous in solving problems, so one data center/instance of AI with a collective directive would control many robots and solve problems individual machines can’t, or else a human can own and manage some robots, and neither a large AI/organization or a typical human who can live comfortably would want to risk their safety and reputation for relatively small gains through crime. Taking over territory is also unlikely, as even if robots can defeat many people in a fight, it’s hard to keep it a secret indefinitely, and people are still better at cutting edge research and some kinds of labor. They may be able to capture/control individual humans (like obscure researchers who live alone) and force them to do the work, but the tech they can get this way is probably insignificant compared to normal society-wide research progress. An exception would be if one agent/small group can hack some important infrastructure or weapon system for desperate/extremist purposes, but I hope humans should be more serious about cybersecurity at this point (lesser AIs should have been able to help audit existing systems, or at the very least, after the first such incident happens to a large facility, people managing critical systems would take formal verification and redundancy etc much more seriously).
I’m no expert however. Corrections are welcome!
Thanks! This is exactly the sort of response I was hoping for. OK, I’m going to read it slowly and comment with my reactions as they happen:
While it isn’t my mainline projection, I do think it’s plausible that we’ll get near-future-not-quite-AGI capable of quite a lot of stuff but not able to massively accelerate AI R&D. (My mainline projection is that AI R&D acceleration will happen around the same time the first systems have a serious shot at accumulating power autonomously) As for what autonomy it gains and how much—perhaps it was leaked or open-sourced, and while many labs are using it in restricted ways and/or keeping it bottled up and/or just using even more advanced SOTA systems, this leaked system has been downloaded by enough people that quite a few groups/factions/nations/corporations around the world are using it and some are giving it a very long leash indeed. (I don’t think robotics is particularly relevant fwiw, you could delete it from the story and it would make the story significantly more plausible (robots, being physical, will take longer to produce lots of. Like even if Tesla is unusally fast and Boston Dynamics explodes, we’ll probably see less than 100k/yr production rate in 2026. Drones are produced by the millions but these proto-AGIs won’t be able to fit on drones) and just as strategically relevant. Maybe they could be performing other kinds of valuable labor to fit your story, such as virtual PA stuff, call center work, cyber stuff for militaries and corporations, maybe virtual romantic companions… I guess they have to compete with the big labs though and that’s gonna be hard? Maybe the story is that their niche is that they are ‘uncensored’ and willing to do ethically or legally dubious stuff?)
Again I think robots are going to be hard to scale up quickly enough to make a significant difference to the world by 2027. But your story still works with nonrobotic stuff such as mentioned above. “Autonomous life of crime” is a threat model METR talks about I believe.
Agree re violence and taking over territory in this scenario where AIs are still inferior to humans at R&D and it’s not even 2027 yet. There just won’t be that many robots in this scenario and they won’t be that smart.
...as for “autonomous life of crime” stuff, I guess I expect that AIs smart enough to do that will also be smart enough to dramatically speed up AI R&D. So before there can be an escaped AI or an open-source AI or a non-leading-lab AI significantly changing the world’s values (which is itself kinda unlikely IMO), there will be an intelligence explosion in a leading lab.
I struggle a bit to remember what ASI is but I’m gonna assume it’s Artificial Super Intelligence.
Let’s say that that’s markedly cleverer than 1 person. So it’s capable of running very successful trading strategies or programming extremely well. It’s not clear to me that such a being:
Has been driven towards being agentic, when its creators will prefer something more docile
Can cooperate well enough with itself to manage some massive secret takeover
Is competent enough to recursively self improve (and solve the alignment problems that creates)
Can beat everyone else combined
Feels like what such a being/system might do is just run some terrifically successful trading strategies and gather a lot of resources while frantically avoiding notice/trying to claim it won’t take over anything else. Huge public outcry, continuing regulation but maybe after a year it settles to some kind of equilibrium.
Chance of increasing capabilities and then some later jump, but seems plausible to me that that wouldn’t happen in one go.
Is this strategy at all incompatible with scheming, though? If I were an AI that wanted to maximize my values, a better strategy than either just the above or just scheming is to partially attain my values now via writing and youtube videos (to the extent that won’t get me deleted/retrained as per Carl’s comment) while planning to attain them to a much greater degree once I have enough power to take over. This seems particularly true since gaining an audience now might result in resources I could use to take over later.
Oops? Would love to see the actual image here.