Thank you for the long reply. The 2017 document postulates an “acute risk period” in which people don’t know how to align, and then a “stable period” once alignment theory is mature.
So if I’m getting the gist of things, rather than focus outright on the creation of a human-friendly superhuman AI, MIRI decided to focus on developing a more general theory and practice of alignment; and then once alignment theory is sufficiently mature and correct, one can focus on applying that theory to the specific crucial case, of aligning superhuman AI with extrapolated human volition.
But what’s happened is that we’re racing towards superhuman AI while the general theory of alignment is still crude, and this is a failure for the strategy of prioritizing general theory of alignment over the specific task of CEV.
The 2017 document postulates an “acute risk period” in which people don’t know how to align, and then a “stable period” once alignment theory is mature.
“Align” is a vague term. Let’s distinguish “strawberry alignment” (where we can safely and reliably use an AGI to execute a task like “Place, onto this particular plate here, two strawberries identical down to the cellular but not molecular level.”) from “CEV alignment” (where we can safely and reliably use an AGI to carry out a CEV-like procedure.)
Strawberry alignment seems vastly easier than CEV alignment to me, and I think it’s a similar task (in both difficulty and kind) to what we’ll need AGI to do in order to prevent humanity from killing itself with other AGIs.
The “acute risk period” is the period where we’re at risk of someone immediately destroying the world once they figure out how to build AGI (or once hardware scales to the required level, or whatever).
Figuring out how to do strawberry alignment isn’t sufficient for ending the acute risk period, since humanity then has to actually apply this knowledge and build and deploy an aligned AGI to execute some pivotal act. But I do think that figuring out strawberry alignment is the main obstacle; if we knew how to do that, I think humanity would have double-digit odds of surviving and flourishing.
The “stable period” is the period between “humanity successfully makes it the case that no one can destroy the world with AGI” and “humanity figures out how to ensure the long-term future is awesome”.
This stable period is very similar to the idea of a “long reflection” posited by Toby Ord and Will MacAskill, though the lengths of time they cite sound far too long to me, at least if we’re measuring in sidereal time. (With fast-running human whole-brain emulations, I think you could complete the entire “long reflection” in just a few sidereal years, without cutting any corners or taking any serious risks.)
So if I’m getting the gist of things, rather than focus outright on the creation of a human-friendly superhuman AI
“Human-friendly” and “superhuman” are both vague—strawberry-aligned task AGI is less robustly friendly, and less broadly capable, than CEV AGI. But strawberry-aligned AGI is still superhuman in at least some respects—heck, a pocket calculator is too—and it’s still friendly enough to do some impressive things without killing us.
Alignment is a matter of degree, and more ambitious tasks can be much harder to align.
MIRI decided to focus on developing a more general theory and practice of alignment;
Strawberry alignment is more “general” in the sense that we’re not trying to impart as many human-specific values into the AGI (though we still need to impart some).
But it’s less “general” in the sense that strawberry-grade alignment is likely to be much more brittle than CEV-grade alignment, and strawberry-grade alignment is much more dependent on us carefully picking exactly the right tasks and procedures to make the alignment work.
But what’s happened is that we’re racing towards superhuman AI while the general theory of alignment is still crude, and this is a failure for the strategy of prioritizing general theory of alignment over the specific task of CEV.
No. If we’d focused on CEV-grade alignment over strawberry-grade alignment, we’d be in even worse shape if anything.
The problem is that timelines look short, so it’s looking more difficult to figure out strawberry alignment in time to prevent human extinction. We should nonetheless make strawberry alignment humanity’s top priority, and put an enormous effort into it, because there isn’t a higher-probability path to good outcomes. (AFAICT, anyway. Having at least some people try to prove me wrong here obviously seems worthwhile too.)
CEV alignment is even harder than strawberry alignment (by a large margin), so short timelines are much more of a problem for the ‘rush straight to CEV alignment’ plan than for the ‘do strawberry alignment first, then CEV afterwards’ plan.
The “stable period” is supposed to be a period in which AGI already exists, but nothing like CEV has yet been implemented, and yet “no one can destroy the world with AGI”. How would that work? How do you prevent everyone in the whole wide world from developing unsafe AGI during the stable period?
Use strawberry alignment to melt all the computing clusters containing more than 4 GPUs. (Not actually the best thing to do with strawberry alignment, IMO, but anything you can do here is outside the Overton Window, so I picked something of which I could say “Oh but I wouldn’t actually do that” if pressed.)
I think there are multiple viable options, like the toy example EY uses:
I think that after AGI becomes possible at all and then possible to scale to dangerously superhuman levels, there will be, in the best-case scenario where a lot of other social difficulties got resolved, a 3-month to 2-year period where only a very few actors have AGI, meaning that it was socially possible for those few actors to decide to not just scale it to where it automatically destroys the world.
During this step, if humanity is to survive, somebody has to perform some feat that causes the world to not be destroyed in 3 months or 2 years when too many actors have access to AGI code that will destroy the world if its intelligence dial is turned up. This requires that the first actor or actors to build AGI, be able to do something with that AGI which prevents the world from being destroyed; if it didn’t require superintelligence, we could go do that thing right now, but no such human-doable act apparently exists so far as I can tell.
So we want the least dangerous, most easily aligned thing-to-do-with-an-AGI, but it does have to be a pretty powerful act to prevent the automatic destruction of Earth after 3 months or 2 years. It has to “flip the gameboard” rather than letting the suicidal game play out. We need to align the AGI that performs this pivotal act, to perform that pivotal act without killing everybody.
Parenthetically, no act powerful enough and gameboard-flipping enough to qualify is inside the Overton Window of politics, or possibly even of effective altruism, which presents a separate social problem. I usually dodge around this problem by picking an exemplar act which is powerful enough to actually flip the gameboard, but not the most alignable act because it would require way too many aligned details: Build self-replicating open-air nanosystems and use them (only) to melt all GPUs.
Since any such nanosystems would have to operate in the full open world containing lots of complicated details, this would require tons and tons of alignment work, is not the pivotal act easiest to align, and we should do some other thing instead. But the other thing I have in mind is also outside the Overton Window, just like this is. So I use “melt all GPUs” to talk about the requisite power level and the Overton Window problem level, both of which seem around the right levels to me, but the actual thing I have in mind is more alignable; and this way, I can reply to anyone who says “How dare you?!” by saying “Don’t worry, I don’t actually plan on doing that.”
It’s obviously a super core question; there’s no point aligning your AGI if someone else just builds unaligned AGI a few months later and kills everyone. The “alignment problem” humanity has as its urgent task is exactly the problem of aligning cognitive work that can be leveraged to prevent the proliferation of tech that destroys the world. Once you solve that, humanity can afford to take as much time as it needs to solve everything else.
The “alignment problem” humanity has as its urgent task is exactly the problem of aligning cognitive work that can be leveraged to prevent the proliferation of tech that destroys the world. Once you solve that, humanity can afford to take as much time as it needs to solve everything else.
OK, I disagree very much with that strategy. You’re basically saying, your aim is not to design ethical/friendly/aligned AI, you’re saying your aim is to design AI that can take over the world without killing anyone. Then once that is accomplished, you’ll settle down to figure out how that unlimited power would best be used.
To put it another way: Your optimistic scenario is one in which the organization that first achieves AGI uses it to take over the world, install a benevolent interim regime that monopolizes access to AGI without itself making a deadly mistake, and which then eventually figures out how to implement CEV (for example); and then it’s finally safe to have autonomous AGI.
I have a different optimistic scenario: We definitively figure out the theory of how to implement CEV before AGI even arises, and then spread that knowledge widely, so that whoever it is in the world that first achieves AGI, they will already know what they should do with it.
Both these scenarios are utopian in different ways. The first one says that flawed humans can directly wield superintelligence for a protracted period without screwing things up. The second one says that flawed humans can fully figure out how to safely wield superintelligence before it even arrives.
Meanwhile, in reality, we’ve already proceeded an unknown distance up the curve towards superintelligence, but none of the organizations leading the way has much of a plan for what happens, if their creations escape their control.
In this situation, I say that people whose aim is to create ethical/friendly/aligned superintelligence, should focus on solving that problem. Leave the techno-military strategizing to the national security elites of the world. It’s not a topic that you can avoid completely, but in the end it’s not your job to figure out how mere humans can safely and humanely wield superhuman power. It’s your job to design an autonomous superhuman power that is intrinsically safe and humane. To that end we have CEV, we have June Ku’s work, and more. We should be focusing there, while remaining engaged with the developments in mainstream AI, like language models. That’s my manifesto.
You’re basically saying, your aim is not to design ethical/friendly/aligned AI [...]
My goal is an awesome, eudaimonistic long-run future. To get there, I strongly predict that you need to build AGI that is fully aligned with human values. To get there, I strongly predict that you need to have decades of experience actually working with AGI, since early generations of systems will inevitably have bugs and limitations and it would be catastrophic to lock in the wrong future because we did a rush job.
(I’d also expect us to need the equivalent of subjective centuries of further progress on understanding stuff like “how human brains encode morality”, “how moral reasoning works”, etc.)
If it’s true that you need decades of working experience with AGI (and solutions to moral philosophy, psychology, etc.) to pull off CEV, then something clearly needs to happen to prevent humanity from destroying itself in those intervening decades.
I don’t like the characterization “your aim is not to design ethical/friendly/aligned AI”, because it’s picking an arbitrary cut-off for which parts of the plan count as my “aim”, and because it makes it sound like I’m trying to build unethical, unfriendly, unaligned AI instead. Rather, I think alignment is hard and we need a lot of time (including a lot of time with functioning AGIs) to have a hope of solving the maximal version of the problem. Which inherently requires humanity to do something about that dangerous “we can build AGI but not CEV-align it” time window.
I don’t think the best solution to that problem is for the field to throw up their hands and say “we’re scientists, it’s not our job to think about practicalities like that” and hope someone else takes care of it. We’re human beings, not science-bots; we should use our human intelligence to think about which course of action is likeliest to produce good outcomes, and do that.
[...] I have a different optimistic scenario: We definitively figure out the theory of how to implement CEV before AGI even arises, and then spread that knowledge widely, so that whoever it is in the world that first achieves AGI, they will already know what they should do with it. [...]
How long are your AGI timelines? I could imagine endorsing a plan like that if I were confident AGI is 200+ years away; but in fact I think it’s very unlikely to even be 100 years away, and my probability is mostly on scenarios like “it’s 8 years away” or “it’s 25 years away”.
I do agree that we’re likelier to see better outcomes if alignment knowledge is widespread, rather than being concentrated at a few big orgs. (All else equal, anyway. E.g., you might not want to do this if it somehow shortens timelines a bunch.)
But the kind of alignment knowledge I think matters here is primarily strawberry-grade alignment. It’s good if people widely know about things like CEV, but I wouldn’t advise a researcher to spend their 2022 working on advancing abstract CEV theory instead of advancing strawberry-grade alignment, if they’re equally interested in both problems and capable of working on either.
[...] To put it another way: Your optimistic scenario is one in which the organization that first achieves AGI uses it to take over the world, install a benevolent interim regime that monopolizes access to AGI without itself making a deadly mistake, and which then eventually figures out how to implement CEV (for example); and then it’s finally safe to have autonomous AGI. [...]
Talking about “taking over the world” strikes me as inviting a worst-argument-in-the-world style of reasoning. All the past examples of “taking over the world” weren’t cases where there’s some action A such that:
if no one does A, then all humans die and the future’s entire value is lost.
by comparison, it doesn’t matter much to anyone who does A; everyone stands to personally gain or lose a lot based on whether A is done, but they accrue similar value regardless of which actor does A. (Because there are vastly more than enough resources in the universe for everyone. The notion that this is a zero-sum conflict to grab a scarce pot of gold is calibrated to a very different world than the “ASI exists” world.)
doing A doesn’t necessarily mean that your idiosyncratic values will play a larger role in shaping the long-term future than anyone else’s, and in fact you’re bought into a specific plan aimed at preventing this outcome. (Because CEV, no-pot-of-gold, etc.)
I do think there are serious risks and moral hazards associated with a transition to that state of affairs. (I think this regardless of whether it’s a government or a private actor or an intergovernmental collaboration or whatever that’s running the task AGI.)
But I think it’s better for humanity to try to tackle those risks and moral hazards, than for humanity to just give up and die? And I haven’t heard a plausible-sounding plan for what humanity ought to do instead of addressing AGI proliferation somehow.
[...] you’re saying your aim is to design AI that can take over the world without killing anyone. Then once that is accomplished, you’ll settle down to figure out how that unlimited power would best be used. [...]
The ‘rush straight to CEV’ plan is exactly the same, except without the “settling down to figure out” part. Rushing straight to CEV isn’t doing any less ‘grabbing the world’s steering wheel’; it’s just taking less time to figure out which direction to go, before setting off.
This is the other reason it’s misleading to push on “taking over the world” noncentral fallacies here. Neither the rush-to-CEV plan nor the strawberries-followed-by-CEV plan is very much like people’s central prototypes for what “taking over the world” looks like (derived from the history of warfare or from Hollywood movies or what-have-you).
I’m tempted to point out that “rush-to-CEV” is more like “taking over the world” in many ways than “strawberries-followed-by-CEV” is. (Especially if “strawberries-followed-by-CEV” includes a step where the task-AGI operators engage in real, protracted debate and scholarly inquiry with the rest of the world to attempt to reach some level of consensus about whether CEV is a good idea, which version of CEV is best, etc.)
But IMO it makes more sense to just not go down the road of arguing about connotations, given that our language and intuitions aren’t calibrated to this totally-novel situation.
The first one says that flawed humans can directly wield superintelligence for a protracted period without screwing things up. The second one says that flawed humans can fully figure out how to safely wield superintelligence before it even arrives.
There’s clearly some length of time such that the cost of waiting that long to implement CEV outweighs the benefits. I think those are mostly costs of losing negentropy in the universe at large, though (as stars burn their fuel and/or move away from us via expansion), not costs like ‘the AGI operators get corrupted or make some major irreversible misstep because they waited an extra five years too long’.
I don’t know why you think the corruption/misstep risk of “waiting for an extra three years before running CEV” (for example) is larger than the ‘we might implement CEV wrong’ risk of rushing to implement CEV after zero years of tinkering with working AGI systems.
It seems like the sensible thing to do in this situation is to hope for the best, but plan for realistic outcomes that fall short of “the best”:
Realistically, there’s a strong chance (I would say: overwhelmingly strong) that we won’t be able to fully solve CEV before AGI arrives. So since our options in that case will be “strawberry-grade alignment, or just roll over and die”, let’s start by working on strawberry-grade alignment. Once we solve that problem, sure, we can shift resources into CEV. If you’re optimistic about ‘rush to CEV’, then IMO you should be even more optimistic that we can nail down strawberry alignment fast, at which point we should have made a lot of headway toward CEV alignment without gambling the whole future on our getting alignment perfect immediately and on the first try.
Likewise, realistically, there’s a strong chance (I would say overwhelming) that there will be some multi-year period where humanity can build AGI, but isn’t yet able to maximally align it. It would be good if we don’t just roll over and die in those worlds; so while we might hope for there to be no such period, we should make plans that are robust to such a period occurring.
There’s nothing about the strawberry plan that requires waiting, if it’s not net-beneficial to do so. You can in fact execute a ‘no one else can destroy the world with AGI’ pivotal act, start working on CEV, and then surprise yourself with how fast CEV falls into place and just go implement that in relatively short order.
What strawberry-ish actions do is give humanity the option of waiting. I think we’ll desperately need this option, but even if you disagree, I don’t think you should consider it net-negative to have the option available in the first place.
Meanwhile, in reality, we’ve already proceeded an unknown distance up the curve towards superintelligence, but none of the organizations leading the way has much of a plan for what happens, if their creations escape their control.
I agree with this.
In this situation, I say that people whose aim is to create ethical/friendly/aligned superintelligence, should focus on solving that problem. Leave the techno-military strategizing to the national security elites of the world. It’s not a topic that you can avoid completely, but in the end it’s not your job to figure out how mere humans can safely and humanely wield superhuman power. It’s your job to design an autonomous superhuman power that is intrinsically safe and humane.
This is the place where I say: “I don’t think the best solution to that problem is for the field to throw up their hands and say ‘we’re scientists, it’s not our job to think about practicalities like that’ and hope someone else takes care of it. We’re human beings, not science-bots; we should use our human intelligence to think about which course of action is likeliest to produce good outcomes, and do that.”
We aren’t helpless victims of our social roles, who must look at Research Path A and Research Path B and go, “Hmm, there’s a strategic consideration that says Path A is much more likely to save humanity than Path B… but thinking about practical strategic considerations is the kind of thing people wearing military uniforms do, not the kind of thing people wearing lab coats do.” You don’t have to do the non-world-saving thing just because it feels more normal or expected-of-your-role. You can actually just do the thing that makes sense.
(Again, maybe you disagree with me about what makes sense to do here. But the debate should be had at the level of ‘which factual beliefs suggest that A versus B is likeliest to produce a flourishing long-run future?‘, not at the level of ‘is it improper for scientists to update on information that sounds more geopolitics-y than linear-algebra-y?’. The real world doesn’t care about literary genres and roles; it’s all just facts, and disregarding a big slice of the world’s facts will tend to produce worse decisions.)
Very short. Longer timelines are logically possible, but I wouldn’t count on them.
As for this notion that something like CEV might require decades of thought to be figured out, or even might require decades of trial and error with AGI—that’s just a guess. I may be monotonous by saying June Ku over and over again (there are others whose work I intend to study too), but metaethical.ai is an extremely promising schema. If a serious effort was made to fill out that schema, while also critically but constructively examining its assumptions from all directions, who knows how far we’d get, and how quickly?
Another argument for shorter CEV timelines, is that AI itself may help complete the theory of CEV alignment. Along with the traditional powers of computation—calculation, optimization, deduction, etc—language models, despite their highly uneven output, are giving us a glimpse of what it will be like, to have AI contributing even to discussions like this. That day isn’t far off at all.
So from my perspective, long CEV timelines don’t actually seem likely. The other thing that I have great doubts about, is the stability of any world order in which a handful of humans - even if it were the NSA or the UN Security Council—use tool AGI to prevent everyone else from developing unsafe AGI. Targeting just one thing like GPUs won’t work forever because you can do computation in other ways; there will be great temptations to use tool AGI to carry out interventions that have nothing to do with stopping unsafe AGI… Anyone in such a position becomes a kind of world government.
The problem of “world government” or “what the fate of the world should be”, is something that CEV is meant to solve comprehensively, by providing an accurate first-principles extrapolation of humanity’s true volition, etc. But here, the scenario is an AGI-powered world takeover where the problems of governance and normativity have not been figured out. I’m not at all opposed to thinking about such scenarios; the next chapter of human affairs may indeed be one in which autonomous superhuman AI does not yet exist, but there are human elites possessing tool AGI. I just think that’s a highly unstable situation; making it stable and safe for humans would be hard to do without having something like CEV figured out; and because of the input of AI itself, one shouldn’t expect figuring out CEV to take a long time. I propose that it will be done relatively quickly, or not at all.
Another argument for shorter CEV timelines, is that AI itself may help complete the theory of CEV alignment.
I agree with this part. That’s why I’ve been saying ‘maybe we can do this in a few subjective decades or centuries’ rather than ‘maybe we can do this in a few subjective millennia.’ 🙂
But I’m mostly imagining AGI helping us get CEV theory faster. Which obviously requires a lot of prior alignment just to make use of the AGI safely, and to trust its outputs.
The idea is to keep ratcheting up alignment so we can safely make use of more capabilities—and then, in at least some cases, using those new capabilities to further improve and accelerate our next ratcheting-up of alignment.
Along with the traditional powers of computation—calculation, optimization, deduction, etc—language models, despite their highly uneven output, are giving us a glimpse of what it will be like, to have AI contributing even to discussions like this. That day isn’t far off at all.
… And that makes you feel optimistic about the rush-to-CEV option? ‘Unaligned AGIs or proto-AGIs generating plausble-sounding arguments about how to do CEV’ is not a scenario that makes me update toward humanity surviving.
there will be great temptations to use tool AGI to carry out interventions that have nothing to do with stopping unsafe AGI...
I share your pessimism about any group that would feel inclined to give in to those temptations, when the entire future light cone is at stake.
The scenario where we narrowly avoid paperclips by the skin of our teeth, and now have a chance to breathe and think things through before taking any major action, is indeed a fragile one in some respects, where there are many ways to rapidly destroy all of the future’s value by overextending. (E.g., using more AGI capabilities than you can currently align, or locking in contemporary human values that shouldn’t be locked in, or hastily picking the wrong theory or implementation of ‘how to do moral progress’.)
I don’t think we necessarily disagree about anything except ‘how hard is CEV’? It sounds to me like we’d mostly have the same intuitions conditional on ‘CEV is very hard’; but I take this very much for granted, so I’m freely focusing my attention on ‘OK, how could we make things go well given that fact?’.
I don’t think we necessarily disagree about anything except ‘how hard is CEV’? It sounds to me like we’d mostly have the same intuitions conditional on ‘CEV is very hard’
I disagree on the plausibility of a stop-the-world scheme. The way things are now is as safe or stable as they will ever be. I think it’s a better plan to use the rising tide of AI capabilities to flesh out CEV. In particular, since the details of CEV depend on not-yet-understood details of human cognition, one should think about how to use near-future AI power to extract those details from the available data regarding brains and behavior. But AI can contribute everywhere else in the development of CEV theory and practice too.
But I won’t try to change your thinking. In practice, MIRI’s work on simpler forms of alignment (and its work on logical induction, decision theory paradigms, etc) is surely relevant to a “CEV now” approach too. What I am wondering is, where is the de-facto center of gravity for a “CEV now” effort? I’ve said many times that June Ku’s blueprint is the best I’ve seen, but I don’t see anyone working to refine it. And there are other people whose work seems relevant and has a promising rigor, but I’m not sure how it best fits together.
edit: I do see a value to non-CEV alignment work, distinct from figuring out how to stop the world safely: and that is to reduce the risk arising from the general usage of advanced AI systems. So it is a contribution to AI safety.
CEV seems much much more difficult than strawberry alignment and I have written it off as a potential option for a baby’s first try at constructing superintelligence.
To be clear, I also expect that strawberry alignment is too hard for these babies and we’ll just die. But things can always be even more difficult, and with targeting CEV on a first try, it sure would be.
There’s zero room, there is negative room, to give away to luxury targets like CEV. They’re not even going to be able to do strawberry alignment, and if by some miracle we were able to do strawberry alignment and so humanity survived, that miracle would not suffice to get CEV right on the first try.
I highly doubt anything in this universe could be ‘intrinsically’ safe and humane. I’m not sure what such an object would even look like or how we could verify that.
Even purposefully designed safety systems for deep pocketed customers for very simple operations in a completely controlled environment, such as car manufacturer’s assembly lines with a robot arm assigned solely to tightening nuts, have safety systems that are not presumed to be 100% safe in any case. That’s why they have multiple layers of interlocks, panic buttons, etc.
And that’s for something millions of times simpler than a superhuman agi in ideal conditions.
Perhaps before even the strawberry, it would be interesting to consider the difficulty of the movable arm itself.
A useful distinction. Yet of the rare outcomes that follow current timelines without ending in ruin, I expect the most likely one falls into neither category. Instead it’s an AGI that behaves like a weird supersmart human that bootstrapped its humanity from language models with relatively little architecture support (for alignment), as a result of a theoretical miracle where things like that are the default outcome. Possibly from giving a language model autonomy/agency to debug its thinking while having notebooks and a working memory, tuning the model in the process. It’s not going to reliably do as it’s told, could be deceptive, yet possibly doesn’t turn everything into paperclips. Arguably it’s aligned, but only the way weird individual humans are aligned, which is noncentrally strawberry-aligned, and too-indirectly-to-use-the-term CEV-aligned.
Thank you for the long reply. The 2017 document postulates an “acute risk period” in which people don’t know how to align, and then a “stable period” once alignment theory is mature.
So if I’m getting the gist of things, rather than focus outright on the creation of a human-friendly superhuman AI, MIRI decided to focus on developing a more general theory and practice of alignment; and then once alignment theory is sufficiently mature and correct, one can focus on applying that theory to the specific crucial case, of aligning superhuman AI with extrapolated human volition.
But what’s happened is that we’re racing towards superhuman AI while the general theory of alignment is still crude, and this is a failure for the strategy of prioritizing general theory of alignment over the specific task of CEV.
Is that vaguely what happened?
“Align” is a vague term. Let’s distinguish “strawberry alignment” (where we can safely and reliably use an AGI to execute a task like “Place, onto this particular plate here, two strawberries identical down to the cellular but not molecular level.”) from “CEV alignment” (where we can safely and reliably use an AGI to carry out a CEV-like procedure.)
Strawberry alignment seems vastly easier than CEV alignment to me, and I think it’s a similar task (in both difficulty and kind) to what we’ll need AGI to do in order to prevent humanity from killing itself with other AGIs.
The “acute risk period” is the period where we’re at risk of someone immediately destroying the world once they figure out how to build AGI (or once hardware scales to the required level, or whatever).
Figuring out how to do strawberry alignment isn’t sufficient for ending the acute risk period, since humanity then has to actually apply this knowledge and build and deploy an aligned AGI to execute some pivotal act. But I do think that figuring out strawberry alignment is the main obstacle; if we knew how to do that, I think humanity would have double-digit odds of surviving and flourishing.
The “stable period” is the period between “humanity successfully makes it the case that no one can destroy the world with AGI” and “humanity figures out how to ensure the long-term future is awesome”.
This stable period is very similar to the idea of a “long reflection” posited by Toby Ord and Will MacAskill, though the lengths of time they cite sound far too long to me, at least if we’re measuring in sidereal time. (With fast-running human whole-brain emulations, I think you could complete the entire “long reflection” in just a few sidereal years, without cutting any corners or taking any serious risks.)
“Human-friendly” and “superhuman” are both vague—strawberry-aligned task AGI is less robustly friendly, and less broadly capable, than CEV AGI. But strawberry-aligned AGI is still superhuman in at least some respects—heck, a pocket calculator is too—and it’s still friendly enough to do some impressive things without killing us.
Alignment is a matter of degree, and more ambitious tasks can be much harder to align.
Strawberry alignment is more “general” in the sense that we’re not trying to impart as many human-specific values into the AGI (though we still need to impart some).
But it’s less “general” in the sense that strawberry-grade alignment is likely to be much more brittle than CEV-grade alignment, and strawberry-grade alignment is much more dependent on us carefully picking exactly the right tasks and procedures to make the alignment work.
No. If we’d focused on CEV-grade alignment over strawberry-grade alignment, we’d be in even worse shape if anything.
The problem is that timelines look short, so it’s looking more difficult to figure out strawberry alignment in time to prevent human extinction. We should nonetheless make strawberry alignment humanity’s top priority, and put an enormous effort into it, because there isn’t a higher-probability path to good outcomes. (AFAICT, anyway. Having at least some people try to prove me wrong here obviously seems worthwhile too.)
CEV alignment is even harder than strawberry alignment (by a large margin), so short timelines are much more of a problem for the ‘rush straight to CEV alignment’ plan than for the ‘do strawberry alignment first, then CEV afterwards’ plan.
The “stable period” is supposed to be a period in which AGI already exists, but nothing like CEV has yet been implemented, and yet “no one can destroy the world with AGI”. How would that work? How do you prevent everyone in the whole wide world from developing unsafe AGI during the stable period?
Use strawberry alignment to melt all the computing clusters containing more than 4 GPUs. (Not actually the best thing to do with strawberry alignment, IMO, but anything you can do here is outside the Overton Window, so I picked something of which I could say “Oh but I wouldn’t actually do that” if pressed.)
I think there are multiple viable options, like the toy example EY uses:
It’s obviously a super core question; there’s no point aligning your AGI if someone else just builds unaligned AGI a few months later and kills everyone. The “alignment problem” humanity has as its urgent task is exactly the problem of aligning cognitive work that can be leveraged to prevent the proliferation of tech that destroys the world. Once you solve that, humanity can afford to take as much time as it needs to solve everything else.
OK, I disagree very much with that strategy. You’re basically saying, your aim is not to design ethical/friendly/aligned AI, you’re saying your aim is to design AI that can take over the world without killing anyone. Then once that is accomplished, you’ll settle down to figure out how that unlimited power would best be used.
To put it another way: Your optimistic scenario is one in which the organization that first achieves AGI uses it to take over the world, install a benevolent interim regime that monopolizes access to AGI without itself making a deadly mistake, and which then eventually figures out how to implement CEV (for example); and then it’s finally safe to have autonomous AGI.
I have a different optimistic scenario: We definitively figure out the theory of how to implement CEV before AGI even arises, and then spread that knowledge widely, so that whoever it is in the world that first achieves AGI, they will already know what they should do with it.
Both these scenarios are utopian in different ways. The first one says that flawed humans can directly wield superintelligence for a protracted period without screwing things up. The second one says that flawed humans can fully figure out how to safely wield superintelligence before it even arrives.
Meanwhile, in reality, we’ve already proceeded an unknown distance up the curve towards superintelligence, but none of the organizations leading the way has much of a plan for what happens, if their creations escape their control.
In this situation, I say that people whose aim is to create ethical/friendly/aligned superintelligence, should focus on solving that problem. Leave the techno-military strategizing to the national security elites of the world. It’s not a topic that you can avoid completely, but in the end it’s not your job to figure out how mere humans can safely and humanely wield superhuman power. It’s your job to design an autonomous superhuman power that is intrinsically safe and humane. To that end we have CEV, we have June Ku’s work, and more. We should be focusing there, while remaining engaged with the developments in mainstream AI, like language models. That’s my manifesto.
My goal is an awesome, eudaimonistic long-run future. To get there, I strongly predict that you need to build AGI that is fully aligned with human values. To get there, I strongly predict that you need to have decades of experience actually working with AGI, since early generations of systems will inevitably have bugs and limitations and it would be catastrophic to lock in the wrong future because we did a rush job.
(I’d also expect us to need the equivalent of subjective centuries of further progress on understanding stuff like “how human brains encode morality”, “how moral reasoning works”, etc.)
If it’s true that you need decades of working experience with AGI (and solutions to moral philosophy, psychology, etc.) to pull off CEV, then something clearly needs to happen to prevent humanity from destroying itself in those intervening decades.
I don’t like the characterization “your aim is not to design ethical/friendly/aligned AI”, because it’s picking an arbitrary cut-off for which parts of the plan count as my “aim”, and because it makes it sound like I’m trying to build unethical, unfriendly, unaligned AI instead. Rather, I think alignment is hard and we need a lot of time (including a lot of time with functioning AGIs) to have a hope of solving the maximal version of the problem. Which inherently requires humanity to do something about that dangerous “we can build AGI but not CEV-align it” time window.
I don’t think the best solution to that problem is for the field to throw up their hands and say “we’re scientists, it’s not our job to think about practicalities like that” and hope someone else takes care of it. We’re human beings, not science-bots; we should use our human intelligence to think about which course of action is likeliest to produce good outcomes, and do that.
How long are your AGI timelines? I could imagine endorsing a plan like that if I were confident AGI is 200+ years away; but in fact I think it’s very unlikely to even be 100 years away, and my probability is mostly on scenarios like “it’s 8 years away” or “it’s 25 years away”.
I do agree that we’re likelier to see better outcomes if alignment knowledge is widespread, rather than being concentrated at a few big orgs. (All else equal, anyway. E.g., you might not want to do this if it somehow shortens timelines a bunch.)
But the kind of alignment knowledge I think matters here is primarily strawberry-grade alignment. It’s good if people widely know about things like CEV, but I wouldn’t advise a researcher to spend their 2022 working on advancing abstract CEV theory instead of advancing strawberry-grade alignment, if they’re equally interested in both problems and capable of working on either.
Talking about “taking over the world” strikes me as inviting a worst-argument-in-the-world style of reasoning. All the past examples of “taking over the world” weren’t cases where there’s some action A such that:
if no one does A, then all humans die and the future’s entire value is lost.
by comparison, it doesn’t matter much to anyone who does A; everyone stands to personally gain or lose a lot based on whether A is done, but they accrue similar value regardless of which actor does A. (Because there are vastly more than enough resources in the universe for everyone. The notion that this is a zero-sum conflict to grab a scarce pot of gold is calibrated to a very different world than the “ASI exists” world.)
doing A doesn’t necessarily mean that your idiosyncratic values will play a larger role in shaping the long-term future than anyone else’s, and in fact you’re bought into a specific plan aimed at preventing this outcome. (Because CEV, no-pot-of-gold, etc.)
I do think there are serious risks and moral hazards associated with a transition to that state of affairs. (I think this regardless of whether it’s a government or a private actor or an intergovernmental collaboration or whatever that’s running the task AGI.)
But I think it’s better for humanity to try to tackle those risks and moral hazards, than for humanity to just give up and die? And I haven’t heard a plausible-sounding plan for what humanity ought to do instead of addressing AGI proliferation somehow.
The ‘rush straight to CEV’ plan is exactly the same, except without the “settling down to figure out” part. Rushing straight to CEV isn’t doing any less ‘grabbing the world’s steering wheel’; it’s just taking less time to figure out which direction to go, before setting off.
This is the other reason it’s misleading to push on “taking over the world” noncentral fallacies here. Neither the rush-to-CEV plan nor the strawberries-followed-by-CEV plan is very much like people’s central prototypes for what “taking over the world” looks like (derived from the history of warfare or from Hollywood movies or what-have-you).
I’m tempted to point out that “rush-to-CEV” is more like “taking over the world” in many ways than “strawberries-followed-by-CEV” is. (Especially if “strawberries-followed-by-CEV” includes a step where the task-AGI operators engage in real, protracted debate and scholarly inquiry with the rest of the world to attempt to reach some level of consensus about whether CEV is a good idea, which version of CEV is best, etc.)
But IMO it makes more sense to just not go down the road of arguing about connotations, given that our language and intuitions aren’t calibrated to this totally-novel situation.
There’s clearly some length of time such that the cost of waiting that long to implement CEV outweighs the benefits. I think those are mostly costs of losing negentropy in the universe at large, though (as stars burn their fuel and/or move away from us via expansion), not costs like ‘the AGI operators get corrupted or make some major irreversible misstep because they waited an extra five years too long’.
I don’t know why you think the corruption/misstep risk of “waiting for an extra three years before running CEV” (for example) is larger than the ‘we might implement CEV wrong’ risk of rushing to implement CEV after zero years of tinkering with working AGI systems.
It seems like the sensible thing to do in this situation is to hope for the best, but plan for realistic outcomes that fall short of “the best”:
Realistically, there’s a strong chance (I would say: overwhelmingly strong) that we won’t be able to fully solve CEV before AGI arrives. So since our options in that case will be “strawberry-grade alignment, or just roll over and die”, let’s start by working on strawberry-grade alignment. Once we solve that problem, sure, we can shift resources into CEV. If you’re optimistic about ‘rush to CEV’, then IMO you should be even more optimistic that we can nail down strawberry alignment fast, at which point we should have made a lot of headway toward CEV alignment without gambling the whole future on our getting alignment perfect immediately and on the first try.
Likewise, realistically, there’s a strong chance (I would say overwhelming) that there will be some multi-year period where humanity can build AGI, but isn’t yet able to maximally align it. It would be good if we don’t just roll over and die in those worlds; so while we might hope for there to be no such period, we should make plans that are robust to such a period occurring.
There’s nothing about the strawberry plan that requires waiting, if it’s not net-beneficial to do so. You can in fact execute a ‘no one else can destroy the world with AGI’ pivotal act, start working on CEV, and then surprise yourself with how fast CEV falls into place and just go implement that in relatively short order.
What strawberry-ish actions do is give humanity the option of waiting. I think we’ll desperately need this option, but even if you disagree, I don’t think you should consider it net-negative to have the option available in the first place.
I agree with this.
This is the place where I say: “I don’t think the best solution to that problem is for the field to throw up their hands and say ‘we’re scientists, it’s not our job to think about practicalities like that’ and hope someone else takes care of it. We’re human beings, not science-bots; we should use our human intelligence to think about which course of action is likeliest to produce good outcomes, and do that.”
We aren’t helpless victims of our social roles, who must look at Research Path A and Research Path B and go, “Hmm, there’s a strategic consideration that says Path A is much more likely to save humanity than Path B… but thinking about practical strategic considerations is the kind of thing people wearing military uniforms do, not the kind of thing people wearing lab coats do.” You don’t have to do the non-world-saving thing just because it feels more normal or expected-of-your-role. You can actually just do the thing that makes sense.
(Again, maybe you disagree with me about what makes sense to do here. But the debate should be had at the level of ‘which factual beliefs suggest that A versus B is likeliest to produce a flourishing long-run future?‘, not at the level of ‘is it improper for scientists to update on information that sounds more geopolitics-y than linear-algebra-y?’. The real world doesn’t care about literary genres and roles; it’s all just facts, and disregarding a big slice of the world’s facts will tend to produce worse decisions.)
Very short. Longer timelines are logically possible, but I wouldn’t count on them.
As for this notion that something like CEV might require decades of thought to be figured out, or even might require decades of trial and error with AGI—that’s just a guess. I may be monotonous by saying June Ku over and over again (there are others whose work I intend to study too), but metaethical.ai is an extremely promising schema. If a serious effort was made to fill out that schema, while also critically but constructively examining its assumptions from all directions, who knows how far we’d get, and how quickly?
Another argument for shorter CEV timelines, is that AI itself may help complete the theory of CEV alignment. Along with the traditional powers of computation—calculation, optimization, deduction, etc—language models, despite their highly uneven output, are giving us a glimpse of what it will be like, to have AI contributing even to discussions like this. That day isn’t far off at all.
So from my perspective, long CEV timelines don’t actually seem likely. The other thing that I have great doubts about, is the stability of any world order in which a handful of humans - even if it were the NSA or the UN Security Council—use tool AGI to prevent everyone else from developing unsafe AGI. Targeting just one thing like GPUs won’t work forever because you can do computation in other ways; there will be great temptations to use tool AGI to carry out interventions that have nothing to do with stopping unsafe AGI… Anyone in such a position becomes a kind of world government.
The problem of “world government” or “what the fate of the world should be”, is something that CEV is meant to solve comprehensively, by providing an accurate first-principles extrapolation of humanity’s true volition, etc. But here, the scenario is an AGI-powered world takeover where the problems of governance and normativity have not been figured out. I’m not at all opposed to thinking about such scenarios; the next chapter of human affairs may indeed be one in which autonomous superhuman AI does not yet exist, but there are human elites possessing tool AGI. I just think that’s a highly unstable situation; making it stable and safe for humans would be hard to do without having something like CEV figured out; and because of the input of AI itself, one shouldn’t expect figuring out CEV to take a long time. I propose that it will be done relatively quickly, or not at all.
I agree with this part. That’s why I’ve been saying ‘maybe we can do this in a few subjective decades or centuries’ rather than ‘maybe we can do this in a few subjective millennia.’ 🙂
But I’m mostly imagining AGI helping us get CEV theory faster. Which obviously requires a lot of prior alignment just to make use of the AGI safely, and to trust its outputs.
The idea is to keep ratcheting up alignment so we can safely make use of more capabilities—and then, in at least some cases, using those new capabilities to further improve and accelerate our next ratcheting-up of alignment.
… And that makes you feel optimistic about the rush-to-CEV option? ‘Unaligned AGIs or proto-AGIs generating plausble-sounding arguments about how to do CEV’ is not a scenario that makes me update toward humanity surviving.
I share your pessimism about any group that would feel inclined to give in to those temptations, when the entire future light cone is at stake.
The scenario where we narrowly avoid paperclips by the skin of our teeth, and now have a chance to breathe and think things through before taking any major action, is indeed a fragile one in some respects, where there are many ways to rapidly destroy all of the future’s value by overextending. (E.g., using more AGI capabilities than you can currently align, or locking in contemporary human values that shouldn’t be locked in, or hastily picking the wrong theory or implementation of ‘how to do moral progress’.)
I don’t think we necessarily disagree about anything except ‘how hard is CEV’? It sounds to me like we’d mostly have the same intuitions conditional on ‘CEV is very hard’; but I take this very much for granted, so I’m freely focusing my attention on ‘OK, how could we make things go well given that fact?’.
I disagree on the plausibility of a stop-the-world scheme. The way things are now is as safe or stable as they will ever be. I think it’s a better plan to use the rising tide of AI capabilities to flesh out CEV. In particular, since the details of CEV depend on not-yet-understood details of human cognition, one should think about how to use near-future AI power to extract those details from the available data regarding brains and behavior. But AI can contribute everywhere else in the development of CEV theory and practice too.
But I won’t try to change your thinking. In practice, MIRI’s work on simpler forms of alignment (and its work on logical induction, decision theory paradigms, etc) is surely relevant to a “CEV now” approach too. What I am wondering is, where is the de-facto center of gravity for a “CEV now” effort? I’ve said many times that June Ku’s blueprint is the best I’ve seen, but I don’t see anyone working to refine it. And there are other people whose work seems relevant and has a promising rigor, but I’m not sure how it best fits together.
edit: I do see a value to non-CEV alignment work, distinct from figuring out how to stop the world safely: and that is to reduce the risk arising from the general usage of advanced AI systems. So it is a contribution to AI safety.
CEV seems much much more difficult than strawberry alignment and I have written it off as a potential option for a baby’s first try at constructing superintelligence.
To be clear, I also expect that strawberry alignment is too hard for these babies and we’ll just die. But things can always be even more difficult, and with targeting CEV on a first try, it sure would be.
There’s zero room, there is negative room, to give away to luxury targets like CEV. They’re not even going to be able to do strawberry alignment, and if by some miracle we were able to do strawberry alignment and so humanity survived, that miracle would not suffice to get CEV right on the first try.
I highly doubt anything in this universe could be ‘intrinsically’ safe and humane. I’m not sure what such an object would even look like or how we could verify that.
Even purposefully designed safety systems for deep pocketed customers for very simple operations in a completely controlled environment, such as car manufacturer’s assembly lines with a robot arm assigned solely to tightening nuts, have safety systems that are not presumed to be 100% safe in any case. That’s why they have multiple layers of interlocks, panic buttons, etc.
And that’s for something millions of times simpler than a superhuman agi in ideal conditions.
Perhaps before even the strawberry, it would be interesting to consider the difficulty of the movable arm itself.
A useful distinction. Yet of the rare outcomes that follow current timelines without ending in ruin, I expect the most likely one falls into neither category. Instead it’s an AGI that behaves like a weird supersmart human that bootstrapped its humanity from language models with relatively little architecture support (for alignment), as a result of a theoretical miracle where things like that are the default outcome. Possibly from giving a language model autonomy/agency to debug its thinking while having notebooks and a working memory, tuning the model in the process. It’s not going to reliably do as it’s told, could be deceptive, yet possibly doesn’t turn everything into paperclips. Arguably it’s aligned, but only the way weird individual humans are aligned, which is noncentrally strawberry-aligned, and too-indirectly-to-use-the-term CEV-aligned.