Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours
Vitalik Buterin wrote an impactful blog post, My techno-optimism. I found this discussion of one aspect on 80,00 hours much more interesting. The remainder of that interview is nicely covered in the host’s EA Forum post.
My techno optimism apparently appealed to both sides, e/acc and doomers. Buterin’s approach to bridging that polarization was interesting. I hadn’t understood before the extent to which anti-AI regulation sentiment is driven by fear of centralized power. I hadn’t thought about this risk before since it didn’t seem relevant to AGI risk, but I’ve been updating to think it’s highly relevant.
[this is automated transcription that’s inaccurate and comically accurate by turns :)]
Rob Wiblin (the host) (starting at 20:49):
what is it about the way that you put the reasons to worry that that ensured that kind of everyone could get behind it
Vitalik Buterin:
[...] in addition to taking you know the case that AI is going to kill everyone seriously I the other thing that I do is I take the case that you know AI is going to take create a totalitarian World Government seriously [...]
[...] then it’s just going to go and kill everyone but on the other hand if you like take some of these uh you know like very naive default solutions to just say like hey you know let’s create a powerful org and let’s like put all the power into the org then yeah you know you are creating the most like most powerful big brother from which There Is No Escape and which has you know control over the Earth and and the expanding light cone and you can’t get out right and yeah I mean this is something that like uh I think a lot of people find very deeply scary I mean I find it deeply scary um it’s uh it is also something that I think realistically AI accelerates right
One simple takeaway is that recognizing and addressing that motivation for anti-regulation and pro-AGI sentiment when trying to work with or around the e/acc movement. But a second is whether to take that fear seriously.
Is centralized power controlling AI/AGI/ASI a real risk?
Vitalik Buterin is from Russia, where centralized power has been terrifying. This has been the case for roughly half of the world. Those that are concerned with of risks of centralized power (including Western libertarians) are worried that AI increases that risk if it’s centralized. This puts them in conflict with x-risk worriers on regulation and other issues.
I used to hold both of these beliefs, which allowed me to dismiss those fears:
AGI/ASI will be much more dangerous than tool AI, and it won’t be controlled by humans
Centralized power is pretty safe (I’m from the West like most alignment thinkers).
Now I think both of these are highly questionable.
I’ve thought in the past that fears AI are largely unfounded. The much larger risk is AGI. And that is an even larger risk if it’s decentralized/proliferated. But I’ve been progressively more convinced that Governments will take control of AGI before it’s ASI, right?. They don’t need to build it, just show up and inform the creators that as a matter of national security, they’ll be making the key decisions about how it’s used and aligned.[1]
If you don’t trust Sam Altman to run the future, you probably don’t like the prospect of Putin or Xi Jinping as world-dictator-for-eternal-life. It’s hard to guess how many world leaders are sociopathic enough to have a negative empathy-sadism sum, but power does seem to select for sociopathy.
I’ve thought that humans won’t control ASI, because it’s value alignment or bust. There’s a common intuition that an AGI, being capable of autonomy, will have its own goals, for good or ill. I think it’s perfectly coherent for it to effectively have someone else’s goals; its “goal slot” is functionally a pointer to someone else’s goals. I’ve written about this in Instruction-following AGI is easier and more likely than value aligned AGI and Max Harms has written about a very similar approach, in more depth more with more clarity and eloquence in his CAST: Corrigibility As Singular Target sequence. I think this is also roughly what Christiano means by corrigibility. I’ll call this personal intent alignment until someone comes up with a better term.
I now think that even if we solved value alignment, no one would implement that solution. People who are in charge of things (like AGI projects) like power. If they don’t like power enough, someone else will rapidly take it from them. The urge to have your nascent godling follow your instructions, not some questionable sum of everyone’s values, is bolstered by the (IMO strong) argument that following your instructions is safer than attempting value alignment. In a moderately slow takeoff, you have time to monitor and instruct its development, and you can instruct it to shut down if its understanding of other instructions is going off the rails (corrigibility).
It looks to me like personal intent alignment[2] (“corrigibility) is both more tempting to AGI creators, and an easier target to hit than value alignment. I wish that value alignment was the more viable option. But wishing won’t make it so. To the extent that’s correct, putting AGI into existing power structures is a huge risk even with technical alignment solved.
Centralized power is not guaranteed to keep going well, particularly with AGI added to the equation. AGI could ensure a dictator stays in power indefinitely.
This is a larger topic, but I think the risk of centralized power is this: those who most want power and who fight for it most viciously tend to get it. That’s a very bad selection effect. Fair democracy with good information about candidates can counteract this tendency to some extent, but that’s really hard. And AGI will entice some of the worst actors to try to get control of it. The payoff for a coup is suddenly even higher.
What can be done
Epistemic status: this is even farther removed from the podcast’s content; it’s just my brief take on the current strategic situation after updating from that podcast. I’ve thought about this a lot recently, but I’m sure there are more big updates to make.
This frightening logic leaves several paths to survival. One is to make personal intent aligned AGI, and get it in the hands of a trustworthy-enough power structure. The second is to create a value-aligned AGI and release it as a sovereign, and hope we got its motivations exactly right on the first try. The third is to Shut It All Down, by arguing convincingly that the first two paths are unlikely to work—and to convince every human group capable of creating or preventing AGI work. None of these seem easy.[3]
As for which of these is least doomed, reasonable opinions vary widely. I’d really like to see the alignment community work together to identify cruxes, so we can present a united front to policy-makers instead of a buffet of expert opinions for them to choose from according to their biases.
Of these, getting personal intent aligned AGI into trustworthy hands seems least doomed to me. I continue to think that We have promising alignment plans with low taxes for the types of AGI that seem most likely to happen at this point. Existing critiques of those plans are not crippling, and the plans seem to bypass the most severe of the List of Lethalities. Further critiques might change my mind. However, those plans all work much better if they’re aimed at personal intent alignment rather than full value alignment with all of humanity.
It seems as though we’ve got a decent chance of getting that AGI into a trustworthy-enough power structure, although this podcast shifted my thinking and lowered my odds of that happening.
Half of the world, and the half that’s ahead in the AGI race right now, has been doing very well with centralized power for the last couple of centuries. That sounds like decent odds, if you’re willing to race for AGI, Aschenbrenner-style. But not as good as I’d like.
And even if we get a personal intent aligned AGI controlled by a democratic government, that democracy only needs to fail once. The newly self-appointed Emperor may well be able to maintain power for all of eternity and all of the light cone.
But that democracy (or other power structure, e.g., a multinational AGI consortium) doesn’t need to last forever. It just needs to last until we have a long (enough) reflection, and use that personal intent aligned AGI (ASI by that time) to complete acceptable value alignment.
Thinking about the risk of centralized power over AGI makes me wonder if we should try to put AGI not only into an international consortium, but make the conditions for power in that organization not technical expertise, but adequate intelligence and knowledge combined with the most incorruptible good character we can find. That’s an extremely vague thought.
I’m no expert in politics, but even I can imagine many ways that goal would be distorted. After all, that’s the goal of pretty much every power selection, and that often goes awry, either through candidates that lie to the public, closed-door power-dealing that benefits those choosing candidates, or outright coups for dictatorship, organized with promises and maintained by a hierarchy of threats.
Anyway, that’s how I currently see our situation. I’d love to see, or be pointed to, alternate takes from others who’ve thought about how power structures might interact with personal intent aligned AGI.
Edit: the rest of his “defensive acceleration (d/acc)” proposal is pretty interesting, but primarily if you’ve got longer timelines or are less focused on AGI risk.
- ^
It seems like the alignment community has been assuming that takeoff would be faster than government recognition of AGI’s unlimited potential, so governments wouldn’t be involved. I think this “inattentive world hypothesis” is one of several subtle updates needed for the medium takeoff scenario we’re anticipating. I didn’t want to mention how likely government takeover is for not wanting to upset the applecart, but after Aschenbrenner’s Situational Awareness shouted it from the rooftops, I think we’ve got to assume that government control of AGI projects is likely if not inevitable.
- ^
I’m adopting the term “personal intent alignment” for things like instruction-following and corrigibility in the Harms or Christiano senses, linked above. I’ll use that until someone else comes up with a better term.
This is following Evan Hubinger’s use of “intent alignment” as the broader class of successful alignment, and to designate it as a narrow section of that broader class. An upcoming post goes into this in more detail, and will be linked here in an edit.
- ^
Brief thoughts on the other options for surviving AGI:
A runner-up argument is Buterin’s proposal of merging with AI, which I also think isn’t a solution to alignment since AGI seems likely to happen far faster than strong BCI tech.
Convincing everyone to Shut It Down is particularly hard in that most humans aren’t utilitarians or longtermists. They’d take a small chance of survival for themselves and their loved ones over a much better chance of eventual utopia for everyone. The wide variances in preferences and beliefs makes it even harder to get everyone who could make AGI to not make it, particularly as technology advances and that class extends. I’m truly confused on what people are hoping for when they advocate shutting it all down. Do they really just want to slow it down to work on alignment, while raising the risk that it’s China or Russia that achieve it? If so, are they accounting for the (IMO strong) possibility that they’d make instruction-following AGI perfectly loyal to a dictator? I’m truly curious.
I’m not sure AGI in the hands of a dictator is actually long-term bad for humanity; I suspect dictator would have to be both strongly sociopathic and sadistic to not share their effectively unlimited wealth at some point in their own evolution. But I’d hate to gamble on this.
Shooting for full value alignment seems like a stronger option. It’s sort of continuous with the path of getting intent-aligned AGI into trustworthy hands, because you’d need someone pretty altruistic to even try it, and they could re-align their AGI for value alignment at any time they choose. But I follow Yudkowsky & co in thinking that any such attempt is likely to move ever farther from the mark as an AGI interprets its instructions or examples differently as it learns more. Nonetheless, I think analyzing how a constitution in language might permanently stabilize an AGI/ASI is worth thinking about.
Is there an option which is “personal intent aligned AGI, but there are 100 of them”? Maybe most governments have one, may be some companies or rich individuals have one. Average Joes can rent a fine tuned AGI by the token, but there’s some limits on what values they can tune it to. There’s a balance of power between the AGIs similar to the balance of power of countries in 2024. Any one AGI could in theory destroy everything, except that the other 99 would oppose it, and so they pre-emptively prevent the creation of any AGI that would destroy everything.
AGIs have close-to-perfect information about each other and thus mostly avoid war because they know who would win, and the weaker AGI just defers in advance. If we get the balance right, no one AGI has more than 50% of the power, hopefully none have more than 20% of the power, such that no one can dominate.
There’s a spectrum from “power is distributed equally amongst all 8 billion people in the world” and “one person or entity controls everything” and this world might be somewhat more towards the unequal end than we have now, but still sitting somewhere along the spectrum.
I guess even if the default outcome is that the first AGI gets such a fast take-off it has an unrecoverable lead over the others, perhaps there are approaches to governance that distribute power to ensure that doesn’t happen.
I wish. I think a 100-way multipolar scenario would be too unstable to last more than a few years.
I think AGI that can self-improve produces really bad game theoretic equilibria, even when the total situation is far from zero-sum. I’m afraid that military technology favors offense over defense, and conflict won’t be limited to the infosphere. It seems to me that the nature of physics makes it way easier to blow stuff up than to make stuff that resists being blown up. Nukes produced mutually assured destruction because they’re so insanely destructive, and controlled only by nation-states with lots of soft targets. That situation isn’t likely to continue with 100 actors that can make new types of splodey weapons.
I hope I’m wrong.
This is probably worth a whole post, but here’s the basic logic. Some of this is crazy scifi stuff, but that’s the business we’re in here, I think, as AGI progresses to AI.
If 100 people have AGI, some of those will pretty quickly become ASI—something so much smarter than humans that they can pretty quickly invent technologies that are difficult to anticipate. It seems like everyone will want to make their AGIs smarter as fast as possible, so they can do more good (by the holders definition) as well as serve better for defense or offense against potential hostile actors. Even if everyone agrees to not develop weapons, it seems like the locally sane option is to at least come up with plans to quickly develop and deploy them—and probably to develop some.
A complete ban on developing weapons, and a strong agreement to split the enormous potential in some way might work—but I have no idea how we get from the current geopolitical state to there.
This hypothetical AGI-developed weaponry doesn’t have to be crazy scifi to be really dangerous. Let’s think of just drones carrying conventional explosives. Or kinetic strikes from orbit that can do about as much or as little damage as is needed for a particular target.
Now, why would anyone be shooting at each other in the first place? Hopefully they wouldn’t, but it would be very tempting to do, and kind of crazy to not build at least a few terrifying weapons to serve as deterrents.
Those 100 entities with AGI might be mostly sane and prefer the vastly larger pie (in the short term) from cooperation to getting a bigger slice of a smaller pie. But that pie will grow again very rapidly if someone decides to fight for full control of the future before someone else does. Those 100 AGI holders will have different ideologies and visions of the future, even if they’re all basically sane and well-intentioned.
Is this scenario stable? As long as you know who fired, you’ve got mutually assured destruction (sort of, but maybe not enough for total deterrence with more limited weapons and smaller conflicts). Having a way to hide or false-flag attacks disrupts that. Just set two rivals on each other if they atand in your way.
The scarier, more complete solution to control the future is to eliminate this whole dangerous multipolar intelligence explosion at the source: Earth. As they say, the only way to be sure is to nuke it from orbit. It seems like one rational act is to at least research how to get a “seed” spacecraft out of the system, and look for technologies or techniques that could send the Sun nova. Presto, peace has been achieved under one remaining ASI. You’d want to know if your rivals could do such a thing, and now that you’ve researched it, maybe you have the capability, too. I’d assume everyone would.
I guess one way to avoid people blowing up your stuff is to just run and hide. One scenario is a diaspora, in which near-C sailships head out in every direction, with the goal of scattering and staying low-profile while building new civilizations. Some existing people might make it in the form of uploads sent with the VNM probes or later by laser. But the culture and values could be preserved, and similar people could be recreated in the flesh or as uploads. The distances and delay might take the pressure off, and make building your own stuff way more attractive than fighting anyone else. And there could be retaliation agreements that actually work as deterrence if they’re designed by better minds than ours.
This is as far as I’ve gotten with this scenario. Hopefully there’s some stable equilibrium, and AGI can help us navigate to it even if we’re in charge. Maybe it’s another MAD thing, or meticulous surveillance of every piece of technology anyone develops.
But it seems way easier to achieve equilibrium on Earth the smaller the number of actors that have AGI capable of RSI.
Because they would immediately organize a merger! The most efficient negotiation outcome will line up with preference utilitarianism. War is a kind of waste, to which there’s always a better deal that could be made.
I also think there are plenty of indications (second section) that the mutual transparency required to carry out a fair (or “chaa”) merger is going to be pretty trivial with even slightly more advanced information technology.
I got a good way through the setup to your first link. It took a while. If you’d be so kind, it would be nice to have a summary of why you think that rather dense set of posts is so relevant here? What I read did not match your link text (“The most efficient negotiation outcome will line up with preference utilitarianism.”) closely enough for this purpose. In some cases, I can get more of my preferences with you eliminated; no negotiation necessary :).
The setup for that post was a single decision, with the failure to cooperate being pretty bad for both parties. The problem here is that that isn’t necessarily the case; the winner can almost take all, depending on their preferences. They can get their desired future in the long run, sacrificing only the short run, which is tiny if you’re really a longtermist. And the post doesn’t seem to address the iterated case; how do you know if someone’s going to renege, after some previous version of them has agreed to “fairly” split the future?
So I don’t understand how the posts you link resolve that concern. Sure with sufficient intelligence you can get “chaa” (from your linked post: “fair”, proportional to power/ability to take what you want), but what if “chaa” is everyone but the first actor dead?
If the solution is “sharing source code” as in earlier work I don’t think that’s at all applicable to network-based AGI; the three body problem of prediction applies in spades.
Hmm well I’d say it gets into that immediately, but it does so in a fairly abstract way. I’d recommend the whole lot though. It’s generally about what looks like a tendency in the math towards the unity of various bargaining systems.
A single decision can be something like “who to be, how to live, from now on”. There isn’t a strict distinction between single decision and all decisions from then on when acts of self-modification are possible, as self-modification changes all future decisions.
On reflection, I’m not sure bargaining theory undermines the point you were making, I do think it’s possible that one party or another would dominate the merger, depending on what the technologies of superintelligent war turn out to be and how much the Us of the participants care about near term strife.
But the feasibility of converging towards merger seems like a relevant aspect of all of this.
Transparency aids/sufficies negotiation, but there wont be much of a negotiation if, say, having nukes turns out to be a very weak bargaining chip and the power distribution is just about who gets nanotech[1] first or whatever, or if it turns out that human utility functions don’t care as much about loss of life in the near term as they do about owning the entirety of the future. I think the latter is very unlikely and the former is debatable.
I don’t exactly believe in “nanotech”. I think materials science advances continuously and practical molecularly precise manufacturing will tend to look like various iterations of synthetic biology (you need a whole lot of little printer heads in order to make a large enough quantity of stuff to matter). There may be a threshold here, though, which we could call “DNA 2.0″ or something, a form of life that uses stronger things than amino acids.
It seems to me that, to whatever degree that that’s true, it’s because the “centralized” power is relatively decentralized (and relatively tightly constrained). There’s a formal power structure, but it has a lot of rules and a lot of players and some number of the ever-popular checks and balances. It’s relatively hard even for the Big Kahuna, or even the Big Kahuna and a few cronies, to make the whole power structure do much of anything, and that’s an intentional part of the design.
If you create an ASI that can and will implement somebody’s personal intent, you’re not trusting the power structure; you’re trusting that person. And if you try to make it more “constitutional” or more about collective intent, you suddenly run into a bunch of complex issues that look more and more like the ones you’d get with “value alignment”[1].
I’m also not sure that anybody actually is doing so well with centralized power structures that it’s very comfortable to trust those power structures with “one mistake and we’re all screwed forever” levels of power. It’s not like anybody in the half of the world you’re talking about has been infallible.
I still can’t bring myself to use the word “aligned” without some kind of distancing, hence the quotes.
I don’t think intent aligned AI has to be aligned to an individual—it can also be intent aligned to humanity collectively.
One thing I used to be concerned about is that collective intent alignment would be way harder than individual intent alignment, making someone validly have an excuse to steer an AI to their own personal intent. I no longer think this is the case. Most issues with collective intent I see as likely also affecting individual intent (e.g. literal instruction following vs extrapolation). I see two big issues that might make collective intent harder than individual intent. One is biased information on people’s intents and another is difficulty of weighting intents for different people. On reflection though, I see both as non-catastrophic, and an imperfect solution to them likely being better for humanity as a whole than following one person’s individual intent.
Interesting. It currently seems to me like collective intent alignment (which I think is what I’m calling value alignment? more below) is way harder than personal intent alignment. So I’m curious where our thinking differs.
I think people are going to want instruction following, not inferring intent from other sources, because they won’t trust the AGI to accurately infer intent that’s not explicitly stated.
I know that’s long been considered terribly dangerous; if you tell your AGI to prevent cancer, it will kill all the humans who could get cancer (or other literal genie hijinks). I think those fears are not realistic with a early-stage AGI in a slow takeoff. With an AGI not too far from human level, it would take time to do something big like cure cancer, so you’d want to have a conversation about how it understands the goal and what methods it will use before letting it use time and resources to research and make plans, and again before they’re executed (and probably many times in the middle). And even LLMs infer intent from instructions pretty well; they know that cure cancer means not killing the host.
In that same slow takeoff scenario that seems likely, concerns about it getting complex inference wrong are much more realistic. Humanity doesn’t know what its intent is, so that machine would have to be quite competent to deduce it correctly. The first AGIs seem likely to not be that smart at launch. The critical piece is that short-term personal intent includes an explicit instruction for the AGI to shut down for re-alignment; humanity’s intent will never be that specific, except for cases where it’s obvious that an action will be disastrous; and if the AGI understands that, it wouldn’t take that action anyway. So it seems to me that personal intent alignment allows a much better chance of adjusting imperfect alignment.
I discuss this more in Instruction-following AGI is easier and more likely than value aligned AGI.
With regard to your “collective intent alignment”, would that be the same thing I’m calling value alignment? I don’t think humanity has a real collective short term intent on most matters; people want very different things. They’d agree on not having humanity made extinct, but beyond that, opinions on what people would like to see happen vary broadly. So any collective intent would seem to be longer term, and so vague as to be equivalent to values (people don’t know what they want in the long term specficially, but they have values that allow them to recognize futures they like or don’t like).
Anyway, I’m curious where and why your view differs. I take this question to be critically important, so working out whether it’s right is pretty critical.
Although I guess the counterargument to its importance is: even if collective alignment was just as easy, the people building AGI would probably align it to their interests instead, just because they like their value/intent more than others’.
IMO: if an AI can trade off between different wants/values of one person, it can do so between multiple people also.
This applies to simple surface wants as well as deep values.
I had trouble figuring out how to respond to this comment at the time because I couldn’t figure out what you meant by “value alignment” despite reading your linked post. After reading you latest post, Conflating value alignment and intent alignment is causing confusion, I still don’t know exactly what you mean by “value alignment” but at least can respond.
What I mean is:
If you start with an intent aligned AI following the most surface level desires/commands, you will want to make it safer and more useful by having common sense, “do what I mean”, etc. As long as you surface-level want it to understand and follow your meta-level desires, then it can step up that ladder etc.
If you have a definition of “value alignment” that is different from what you get from this process, then I currently don’t think that it is likely to be better than the alignment from the above process.
In the context of collective intent alignment:
If you have an AI that only follows commands, with no common sense etc., and it’s powerful enough to take over, you die. I’m pretty sure some really bad stuff is likely to happen even if you have some “standing orders”. So, I’m assuming people would actually deploy only an AI that has some understanding of what the person(s) it’s aligned with wants, beyond the mere text of a command (though not necessarily super-sophisticated). But once you have that, you can aggregate how much people want between humans for collective intent alignment.
I’m aware people want different things, but don’t think it’s a big problem from a technical (as opposed to social) perspective—you can ask how much people want the different things. Ambiguity in how to aggregate is unlikely to cause disaster, even if people will care about it a lot socially. Self-modification will cause a convergence here, to potentially different attractors depending on the starting position. Still unlikely to cause disaster. The AI will understand what people actually want from discussions with only a subset of the world’s population, which I also see as unlikely to cause disaster, even if people care about it socially.
From a social perspective, obviously a person or group who creates an AI may be tempted to create alignment to themselves only. I just don’t think collective alignment is significantly harder from a technical perspective.
“Standing orders” may be desirable initially as a sort of training wheels even with collective intent, and yes that could cause controversy as they’re likely not to originate from humanity collectively.
How would your position change if you knew how to implement AI RSI but didn’t have a way to align it and could see flaws in every alignment proposal you’d read?
I’d “play to the outs” by questioning whether there are gaps in my understanding of why alignment is sure to fail, and looking for methods that might exploit those gaps. That is in fact what I’ve done to identify the methods I mentioned. You didn’t say whether you’ve read those.
I can certainly see “flaws” in every alignment proposal I’ve ever read. But for some of them the flaws aren’t so severe that I’m anywhere near sure they’ll fail. That provides a vastly better chance than trying to Shut it All Down without a viable plan.
I suppose I’m obligated to reply after I posted what I did, so, looking at “We have promising alignment plans with low taxes”:
The system design of “Learning & Steering” has very high “taxes”. It doesn’t work with fully recursive self-improvement. With linear chains of self-improvement, looking at humans indicates that it works less well the more you do.
I’m 0% worried about AutoGPT. RSI requires self-modification requires control of (something like) training. In that case, “Internal independent review for language model agent alignment” doesn’t work.
The Natural Abstraction Hypothesis is wrong when comparing different levels of intelligence. Bees and humans both have concepts for distance, but bees don’t have a concept for electrical resistance. So “Just Retarget The Search” doesn’t work for ASI.
Thanks for taking a look! You’re not obligated to pursue this farther, although I do really want to get some skeptics to fully understand these proposals to poke holes in them.
I don’t think any of these are foolproof, but neither are the guaranteed or even likely to fail AFAICT from the critical analysis people have offered so far. To your specific points (which I am probably misunderstanding in places, sorry for that):
I’m not following I think? Since all capable RL systems are already actor-critic of some sort, it seems like the alignment tax for this one is very near zero. And it seems like it works with recursive self-improvement the way any primary goal does: it’s reflectively stable. Making a decision to change your goal is completely against any decision system that effectively pursues that goal and can anticipate outcomes.
I’m also 0% worried about AutoGPT. I described the systems I am worried about in Capabilities and alignment of LLM cognitive architectures. I agree that they need to do online learning. That could be done with a better version of episodic memory, or weight retraining, or both. As above, if the same decision algorithm is used to decide what new knowledge to incorporate, and the system is smart enough to anticipate failure modes (and accurate predictions of outcomes seems necessary to be dangerous), it will avoid taking actions (including incorporating new knowledge) that change its current alignment/goals.
I don’t think the natural abstraction hypothesis has to be true for retargeting the search to work. You just need to identify a representation of goals you like. That’s harder if the system doesn’t use exactly the same abstractions you do, but it’s far from obvious that it’s impossible—or even difficult. Value is complex and fragile, but nobody has argued convincingly for just how complex and fragile.
I’m evaluating these in the context of a slow takeoff, and using the primary goal as something like “follow this guy’s instructions”. [Instruction-following AGI is easier and more likely than value aligned AGI. This provides some corrigibility and the ability to use the nascent AGI as a collaborator in improving its alignment as it grows more capable, which seems to me like it should help any technical alignment approach pretty dramatically.
Let’s say a parent system is generating a more-capable child system. If the parent could perfectly predict what the child system does, it wouldn’t need to make it in the first place. Your argument here assumes that either
the parent system can perfectly predict how well a child will follow its values
or the parent system won’t have its metrics gamed if it evaluates performance in practice
But humans are evidence that the effectiveness of both predictions and evaluations is limited. And my understanding of RSI indicates that effectiveness will be limited for AI too. So the amount of self-improvement that can be controlled effectively is very limited, 1 or perhaps 2 stages. Value drift increases with RSI amount and can’t be prevented.
Alignment/value drift is definitely something I’m concerned about.
I wrote about it in a paper, Goal changes in intelligent agents and a post, The alignment stability problem.
But those are more about the problem. My thinking has come around to thinking that reflective stability is probably enough to counteract value drift. But it’s not guaranteed by any means.
Value drift will happen, but the question is how much? The existing agent will try to give successors the same alignment/goals it has (or preserve its own goals if it’s learning or otherwise self-modifying).
So there are two forces at work: an attempt to maintain alignment by the agent itself, and an accidental drift away from those values. The question is how much drift happens in the sum of those two forces.
If we’re talking about successors, it’s exactly solving the alignment problem again. I’d expect AGI to be better at that if it’s overall smarter/more cognitively competent than humans. If it’s not, I wouldn’t trust it to solve that problem alone, and I’d want humans involved. That’s why I call my alignment proposal “do what I mean and check”, DWIMAC as a variant of instruction-following; I wouldn’t want a parahuman-level AGI doing important things (like aligning a successor) without consulting closely with its creators before acting.
Once it’s smarter than human, I’d expect its alignment attempts to be good enough to largely succeed, even though some small amount of drift/imperfections seems inevitable.
If we need a totally precise value alignment for success, that wouldn’t work. But it seems like there are a variety of outcomes we’d find quite good, so the match doesn’t need to be perfect; there’s room for some drift.
So this is a complex issue, but I don’t think it’s probably a showstopper. But it’s another question that deserves more thought before we launch a real AGI that learns, self-improves, and helps design successors.
I’m confident that’s wrong. (I also think you overestimate the stability of human values because you’re not considering the effect of stability of cultural environment.)
Why?
Consider how AutoGPT works. It spawns new processes that handle subtasks. But those subtasks are never perfectly aligned with the original task.
Again, it only works to a limited extent in humans.
That’s not the right way of thinking about it. There isn’t some threshold where you “solve the alignment problem” completely and then all future RSI has zero drift. All you can do is try to improve how well it’s solved under certain circumstances. As the child system gets smarter the problem is different and more difficult. That’s why you get value drift at each step.
see also this post
I think you’re saying there will be nonzero drift, and that’s a possible problem. I agree. I just don’t think it’s likely to be a disastrous problem.
That post, on “minutes from a human alignment meeting”, is addressing something I think of as important but different than drift: value mis-specification, or equivalently, value mis-generalization. That could be a huge problem without drift playing a role, and vice versa; I think they’re pretty seperable.
I wasn’t trying to say what you took me to. I just meant that when each AI creates a successor, it has to solve the alignment problem again. I don’t think there will be zero drift at any point, just little enough to count as success. If an AGI cares about following instructions from a designated human, it could quite possibly create a successor that also cares about following instructions from that human. That’s potentially good enough alignment to makes humans lives a lot better and prevents their extinction. Each successor might have slightly different values in other areas from drift, but that would be okay if the largest core motivation stays approximately the same.
So I think the important question is how much drift and how close does the value match need to be. I tried to find all of the work/thinking on the question of how close a value match needs to be. But exactly how complex and fragile? addresses that but the discussion doesn’t get far and nobody references other work, so I think we just don’t know and need to work that out.
I used the example of following human instructions because that also provides some amount of a basin of attraction for alignment, so that close-enough-is-good-enough. But even without that, I think it’s pretty likely that reflective stability provides enough compensation for drift to essentially work and provide good-enough alignment indefinitely.
But again, it’s a question of how much drift and how much stabilization/alignment each AGI can perform.
I’m not sure what exactly your intuition on this coming from, but you’re wrong here, and I’m afraid it’s not a matter of my opinion. But I guess we’ll have to agree to disagree.
This is why I’m expecting an international project for safe AI. The USA government isn’t going to leave powerful AI in the hands of Altman or Google, and the rest of the world isn’t going to sit idly while the USA becomes the sole AGI powerhouse.
An international project to create utopian AI is the only path I can imagine which avoids MAD. If there’s a better plan, I haven’t heard it.
This describes why I want an international consortium to work on AGI. I’m afraid I don’t expect it as a likely outcome. It’s the sensible thing to do, but governments aren’t that great at doing that, let alone working together, on a relatively short time frame.
I do think this is probably what we should be arguing and advocating for.
If this doesn’t happen I don’t think we even get a MAD standoff; with two or more parties with RSI capable AGI, it’s more like a non-iterated prisoners dilemma: whoever shoots first wins it all. That’s even worse. But that scenario hasn’t gotten nearly enough analysis, so I’m not sure.
Nonetheless,
I noticed an incomplete sentence there from nonetheless onwards.