Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours

Vitalik Buterin wrote an impactful blog post, My techno-optimism. I found this discussion of one aspect on 80,00 hours much more interesting. The remainder of that interview is nicely covered in the host’s EA Forum post.

My techno optimism apparently appealed to both sides, e/​acc and doomers. Buterin’s approach to bridging that polarization was interesting. I hadn’t understood before the extent to which anti-AI regulation sentiment is driven by fear of centralized power. I hadn’t thought about this risk before since it didn’t seem relevant to AGI risk, but I’ve been updating to think it’s highly relevant.

[this is automated transcription that’s inaccurate and comically accurate by turns :)]

Rob Wiblin (the host) (starting at 20:49):

what is it about the way that you put the reasons to worry that that ensured that kind of everyone could get behind it

Vitalik Buterin:

[...] in addition to taking you know the case that AI is going to kill everyone seriously I the other thing that I do is I take the case that you know AI is going to take create a totalitarian World Government seriously [...]

[...] then it’s just going to go and kill everyone but on the other hand if you like take some of these uh you know like very naive default solutions to just say like hey you know let’s create a powerful org and let’s like put all the power into the org then yeah you know you are creating the most like most powerful big brother from which There Is No Escape and which has you know control over the Earth and and the expanding light cone and you can’t get out right and yeah I mean this is something that like uh I think a lot of people find very deeply scary I mean I find it deeply scary um it’s uh it is also something that I think realistically AI accelerates right

One simple takeaway is that recognizing and addressing that motivation for anti-regulation and pro-AGI sentiment when trying to work with or around the e/​acc movement. But a second is whether to take that fear seriously.

Is centralized power controlling AI/​AGI/​ASI a real risk?

Vitalik Buterin is from Russia, where centralized power has been terrifying. This has been the case for roughly half of the world. Those that are concerned with of risks of centralized power (including Western libertarians) are worried that AI increases that risk if it’s centralized. This puts them in conflict with x-risk worriers on regulation and other issues.

I used to hold both of these beliefs, which allowed me to dismiss those fears:

  1. AGI/​ASI will be much more dangerous than tool AI, and it won’t be controlled by humans

  2. Centralized power is pretty safe (I’m from the West like most alignment thinkers).

Now I think both of these are highly questionable.

I’ve thought in the past that fears AI are largely unfounded. The much larger risk is AGI. And that is an even larger risk if it’s decentralized/​proliferated. But I’ve been progressively more convinced that Governments will take control of AGI before it’s ASI, right?. They don’t need to build it, just show up and inform the creators that as a matter of national security, they’ll be making the key decisions about how it’s used and aligned.[1]

If you don’t trust Sam Altman to run the future, you probably don’t like the prospect of Putin or Xi Jinping as world-dictator-for-eternal-life. It’s hard to guess how many world leaders are sociopathic enough to have a negative empathy-sadism sum, but power does seem to select for sociopathy.

I’ve thought that humans won’t control ASI, because it’s value alignment or bust. There’s a common intuition that an AGI, being capable of autonomy, will have its own goals, for good or ill. I think it’s perfectly coherent for it to effectively have someone else’s goals; its “goal slot” is functionally a pointer to someone else’s goals. I’ve written about this in Instruction-following AGI is easier and more likely than value aligned AGI and Max Harms has written about a very similar approach, in more depth more with more clarity and eloquence in his CAST: Corrigibility As Singular Target sequence. I think this is also roughly what Christiano means by corrigibility. I’ll call this personal intent alignment until someone comes up with a better term.

I now think that even if we solved value alignment, no one would implement that solution. People who are in charge of things (like AGI projects) like power. If they don’t like power enough, someone else will rapidly take it from them. The urge to have your nascent godling follow your instructions, not some questionable sum of everyone’s values, is bolstered by the (IMO strong) argument that following your instructions is safer than attempting value alignment. In a moderately slow takeoff, you have time to monitor and instruct its development, and you can instruct it to shut down if its understanding of other instructions is going off the rails (corrigibility).

It looks to me like personal intent alignment[2] (“corrigibility) is both more tempting to AGI creators, and an easier target to hit than value alignment. I wish that value alignment was the more viable option. But wishing won’t make it so. To the extent that’s correct, putting AGI into existing power structures is a huge risk even with technical alignment solved.

Centralized power is not guaranteed to keep going well, particularly with AGI added to the equation. AGI could ensure a dictator stays in power indefinitely.

This is a larger topic, but I think the risk of centralized power is this: those who most want power and who fight for it most viciously tend to get it. That’s a very bad selection effect. Fair democracy with good information about candidates can counteract this tendency to some extent, but that’s really hard. And AGI will entice some of the worst actors to try to get control of it. The payoff for a coup is suddenly even higher.

What can be done

Epistemic status: this is even farther removed from the podcast’s content; it’s just my brief take on the current strategic situation after updating from that podcast. I’ve thought about this a lot recently, but I’m sure there are more big updates to make.

This frightening logic leaves several paths to survival. One is to make personal intent aligned AGI, and get it in the hands of a trustworthy-enough power structure. The second is to create a value-aligned AGI and release it as a sovereign, and hope we got its motivations exactly right on the first try. The third is to Shut It All Down, by arguing convincingly that the first two paths are unlikely to work—and to convince every human group capable of creating or preventing AGI work. None of these seem easy.[3]

As for which of these is least doomed, reasonable opinions vary widely. I’d really like to see the alignment community work together to identify cruxes, so we can present a united front to policy-makers instead of a buffet of expert opinions for them to choose from according to their biases.

Of these, getting personal intent aligned AGI into trustworthy hands seems least doomed to me. I continue to think that We have promising alignment plans with low taxes for the types of AGI that seem most likely to happen at this point. Existing critiques of those plans are not crippling, and the plans seem to bypass the most severe of the List of Lethalities. Further critiques might change my mind. However, those plans all work much better if they’re aimed at personal intent alignment rather than full value alignment with all of humanity.

It seems as though we’ve got a decent chance of getting that AGI into a trustworthy-enough power structure, although this podcast shifted my thinking and lowered my odds of that happening.

Half of the world, and the half that’s ahead in the AGI race right now, has been doing very well with centralized power for the last couple of centuries. That sounds like decent odds, if you’re willing to race for AGI, Aschenbrenner-style. But not as good as I’d like.

And even if we get a personal intent aligned AGI controlled by a democratic government, that democracy only needs to fail once. The newly self-appointed Emperor may well be able to maintain power for all of eternity and all of the light cone.

But that democracy (or other power structure, e.g., a multinational AGI consortium) doesn’t need to last forever. It just needs to last until we have a long (enough) reflection, and use that personal intent aligned AGI (ASI by that time) to complete acceptable value alignment.

Thinking about the risk of centralized power over AGI makes me wonder if we should try to put AGI not only into an international consortium, but make the conditions for power in that organization not technical expertise, but adequate intelligence and knowledge combined with the most incorruptible good character we can find. That’s an extremely vague thought.

I’m no expert in politics, but even I can imagine many ways that goal would be distorted. After all, that’s the goal of pretty much every power selection, and that often goes awry, either through candidates that lie to the public, closed-door power-dealing that benefits those choosing candidates, or outright coups for dictatorship, organized with promises and maintained by a hierarchy of threats.

Anyway, that’s how I currently see our situation. I’d love to see, or be pointed to, alternate takes from others who’ve thought about how power structures might interact with personal intent aligned AGI.

Edit: the rest of his “defensive acceleration (d/​acc)” proposal is pretty interesting, but primarily if you’ve got longer timelines or are less focused on AGI risk.

  1. ^

    It seems like the alignment community has been assuming that takeoff would be faster than government recognition of AGI’s unlimited potential, so governments wouldn’t be involved. I think this “inattentive world hypothesis” is one of several subtle updates needed for the medium takeoff scenario we’re anticipating. I didn’t want to mention how likely government takeover is for not wanting to upset the applecart, but after Aschenbrenner’s Situational Awareness shouted it from the rooftops, I think we’ve got to assume that government control of AGI projects is likely if not inevitable.

  2. ^

    I’m adopting the term “personal intent alignment” for things like instruction-following and corrigibility in the Harms or Christiano senses, linked above. I’ll use that until someone else comes up with a better term.

    This is following Evan Hubinger’s use of “intent alignment” as the broader class of successful alignment, and to designate it as a narrow section of that broader class. An upcoming post goes into this in more detail, and will be linked here in an edit.

  3. ^

    Brief thoughts on the other options for surviving AGI:

    A runner-up argument is Buterin’s proposal of merging with AI, which I also think isn’t a solution to alignment since AGI seems likely to happen far faster than strong BCI tech.

    Convincing everyone to Shut It Down is particularly hard in that most humans aren’t utilitarians or longtermists. They’d take a small chance of survival for themselves and their loved ones over a much better chance of eventual utopia for everyone. The wide variances in preferences and beliefs makes it even harder to get everyone who could make AGI to not make it, particularly as technology advances and that class extends. I’m truly confused on what people are hoping for when they advocate shutting it all down. Do they really just want to slow it down to work on alignment, while raising the risk that it’s China or Russia that achieve it? If so, are they accounting for the (IMO strong) possibility that they’d make instruction-following AGI perfectly loyal to a dictator? I’m truly curious.

    I’m not sure AGI in the hands of a dictator is actually long-term bad for humanity; I suspect dictator would have to be both strongly sociopathic and sadistic to not share their effectively unlimited wealth at some point in their own evolution. But I’d hate to gamble on this.

    Shooting for full value alignment seems like a stronger option. It’s sort of continuous with the path of getting intent-aligned AGI into trustworthy hands, because you’d need someone pretty altruistic to even try it, and they could re-align their AGI for value alignment at any time they choose. But I follow Yudkowsky & co in thinking that any such attempt is likely to move ever farther from the mark as an AGI interprets its instructions or examples differently as it learns more. Nonetheless, I think analyzing how a constitution in language might permanently stabilize an AGI/​ASI is worth thinking about.