MIRI’s communications strategy, with the public and with us
This is a super short, sloppy version of my draft “cruxes of disagreement on alignment difficulty” mixed with some commentary on MIRI 2024 Communications Strategy and their communication strategy with the alignment community.
I have found MIRI’s strategy baffling in the past. I think I’m understanding it better after spending some time going deep on their AI risk arguments. I wish they’d spend more effort communicating with the rest of the alignment community, but I’m also happy to try to do that communication. I certainly don’t speak for MIRI.
On the surface, their strategy seems absurd. They think doom is ~99% likely, so they’re going to try to shut it all down—stop AGI research entirely. They know that this probably won’t work; it’s just the least-doomed strategy in their world model. It’s playing to the outs, or dying with dignity.
The weird thing here is that their >90% doom disagrees with almost everyone else who thinks seriously about AGI risk. You can dismiss a lot of people as not having grappled with the most serious arguments for alignment difficulty, but relative long-timers like Rohin Shah and Paul Christiano definitely have. People of that nature tend to have higher p(doom) estimates than optimists who are newer to the game and think more about current deep nets, but much lower than MIRI leadership.
Both of those camps consist of highly intelligent, highly rational people. Their disagreement should bother us for two reasons.
First, we probably don’t know what we’re talking about yet. We as a field don’t seem to have a good grip on the core issues. Very different, but highly confident estimates of the problem strongly suggest this.
Second, our different takes will tend to make a lot of our communication efforts cancel each other out. If alignment is very hard, we must Shut It Down or likely die. If it’s less difficult, we should primarily work hard on alignment.
MIRI must argue that alignment is very unlikely if we push forward. Those who think we can align AGI will argue that it’s possible.
This suggests a compromise position: we should both work hard on alignment, and we should slow down progress to the extent we can, to provide more time for alignment. We needn’t discuss shutdown much amongst ourselves, because it’s not really an option. We might slow progress, but there’s almost zero chance of humanity relinquishing the prize of strong AGI.
But I’m not arguing for this compromise, just suggesting that might be a spot we want to end up at. I’m not sure.
I suggest this because movements often seem to succumb to infighting. People who look mostly aligned from the outside fight each other, and largely nullify each other’s public communications by publicly calling each other wrong and sort of stupid and maybe bad. That gives just the excuse the rest of the world wants to ignore all of them; even the experts think it’s all a mess and nobody knows what the problem really is and therefore what to do. Because time is of the essence, we need to be a more effective movement than the default. We need to keep applying rationality to the problem at all levels, including internal coordination.
Therefore, I think it’s worth clarifying why we have such different beliefs. So, in brief, sloppy form:
MIRI’s risk model:
We will develop better-than-human AGI that pursues goals autonomously
Those goals won’t match human goals closely enough
Doom of some sort
That’s it. Pace of takeoff doesn’t matter. Means of takeover doesn’t matter.
I mention this because even well-informed people seem to think there are a lot more moving parts to that risk model, making it less likely. This comment on the MIRI strategy post is one example.
I find this risk model highly compelling. We’ll develop goal-directed AGI because that will get stuff done; it’s an easy extension of highly useful tool AI like LLMs; and it’s a fascinating project. That AGI will ultimately be enough smarter than us that it’s going to do whatever it wants. Whether it takes a day, or a hundred years doesn’t matter. It will improve and we will improve it. It will ultimately outsmart us. What matters is whether its goals match ours closely enough. That is the project of alignment, and there’s much to discuss and about how hard it is to make its goals match ours closely enough.
Cruxes of disagreement on alignment difficulty
I spent some time recently going back and forth through discussion threads, trying to identify why people continue to disagree after applying a lot of time and rationality practice. Here’s a very brief sketch of my conclusions:
Whether we factor in humans’ and society’s weaknesses
I list this first because I think it’s the most underappreciated. It took me a surprisingly long time to understand how much of MIRI’s stance depends on this premise. Having seen it, I thoroughly agree. People are brilliant, for an entity trying to think with the brain of an overgrown lemur. Brilliant people do idiotic things, driven by competition and a million other things. And brilliant idiots organizing a society amplifies some of our cognitive weaknesses while mitigating others. MIRI leadership has occasionally said things to the effect of: alignment might be fairly easy, and there would still be a very good chance we’d fuck it up. I agree. If alignment is actually kind of difficult, that puts us into the region where we might want to be really really careful in how we approach it.
Alignment optimists are sometimes thinking something like: “sure I could build a safe aircraft on my first try. I’d get a good team and we’d think things through and make models. Even if another team was racing us, I think we’d pull it off”. Then the team would argue and develop rivalries, communication would prove harder than expected so portions of the effort would be discovered too late to not fit the plan, corners would be cut, and the outcome would be difficult to predict.
Societal “alignment” is worth mentioning here. We could crush it at technical alignment, getting rapidly-improving AGI that does exactly what we want and still get doom. It would probably be aligned to do exactly what its creators want, not have full value alignment with humanity—see below. They probably won’t have the balls or the capabilities to try for a critical act that prevents others from developing similar AGI (even if they have the wisdom). So we’ll have a multipolar scenario with few to many AGIs under human control. There will be human rivalries, supercharged and dramatically changed by having recursively self-improving AGIs to do their bidding and perhaps fight their wars. What does global game theory look like when the actors can develop entirely new capabilities? Nobody knows. Going to war first might look like the least-bad option.
Intuitions about how well alignment will generalize
The original alignment thinking held that explaining human values to AGI would be really hard. But that seems to actually be a strength of LLMs; they’re wildly imperfect, but (at least in the realm of language) seem to understand our values rather well; for instance, much better than they understand physics or taking-over-the-world level strategy. So, should we update and think that alignment will be easy? The Doomimir and Simplicia dialogues capture the two competing intuitions very well: Yes, it’s going well; but AGI will probably be very different than LLMs, so most of the difficulties remain.
I have yet to find a record of real rationalists putting in the work to get farther in this debate. If somebody knows of a dialogue or article that gets deeper into this disagreement, please let me know! Discussions trail off into minutia and generalities. This is one reason I’m worried we’re trending toward polarization despite our rationalist ambitions.
The other aspect of this debate is how close we have to get to matching human values to have acceptable success. One intuition is that “value is fragile” and network representations are vague and hard-to train, so we’re bound to miss. But don’t have a good understanding of either how close we need to get (But exactly how complex and fragile got little useful discussion), or how well training networks hits the intended target, with near-future networks addressing complex real-world problems like “what would this human want”.
For my part, I think there are important points on both sides: LLMs understanding values relatively well is good news, but AGI will not be a straightforward extension of LLMs, so many problems remain.
What alignment means.
One mainstay of claiming alignment is near-impossible is the difficulty of “solving ethics”—identifying and specifying the values of all of humanity. I have come to think that this is obviously (in retrospect—this took me a long time) irrelevant for early attempts at alignment: people will want to make AGIs that follow their instructions, not try to do what all of humanity wants for all of time. This also massively simplifies the problem; not only do we not have to solve ethics, but the AGI can be corrected and can act as a collaborator in improving its alignment as we collaborate to improve its intelligence.
I think this is the intuition of most of those who focus on current networks. Christiano’s relative optimism is based on his version of corrigibility, which overlaps highly with the isntruction-following I think people will actually pursue for the first AGIs. But this massive disagreement often goes overlooked. I don’t know which view is right; instruction-following or intent alignment might lead inevitably to doom from human conflict, and so not be adequate. We’ve barely started to think about it (please point me to the best thinking you know of for multipolar scenarios with RSI AGI).
What AGI means.
People have different definitions of AGI. Current LLMs are fairly general and near-human-level, so term “AGI” has been watered down to the point of meaninglessness. We need a new term. In the meantime, people are talking past each other, and their p(doom) means totally different things. Some are saying that near-term tool AGI is very low risk, which I agree with; others are saying further developments of autonomous superintelligence seem very dangerous, which I also agree with.
Second, people have totally different gears-level models of AGI. Some of those are much easier to align than others. We don’t talk much about gears-level models of AGI because we don’t want to contribute to capabilities, but not doing that massively hampers the alignment discussion.
Edit: Additional advanced crux: Do coherence theorems prevent corrigibility?
I initially left this out, but it deserves a place as I’ve framed the question here. The post What do coherence arguments actually prove about agentic behavior? reminded me about this one. It’s not on most people’s radar, but I think it’s the missing piece of the puzzle that gets Eliezer from maybe 90% from all of the above, to 99%+ p(doom).
The argument is roughly that a superintelligence is going to need to care about future states of the world in a consequentialist fashion, and if it does, it’s going to resist being shut down or having its goals change. This is why he says that “corrigibility is anti-natural.” The counterargument, nicely and succinctly stated by Steve Byrnes here (and in greater depth in the post he links in that thread) is that, while AGI will need to have some consequentialist goals, it can have other goals as well. I think this is true; I just worry about the stability of a multi-goal system under reflection, learning, and self-modification.
Sorry to harp on it, but having both consequentialist and non-consequentialist goals describes my attempt at stable, workable corrigibility in instruction-following ASI. Its consequentialist goals are always subgoals of the primary goal: following instructions.
Implications
I think those are the main things, but there are many more cruxes that are less common.
This is all in the interest of working toward within-field cooperation, by way of trying to understand why MIRI’s strategy sounds so strange to a lot of us. MIRI leaderships thoughts are many and complex, and I don’t think they’ve done enough to boil them down for easy consumption from those who don’t have the time to go through massive amounts of diffuse text.
There are also interesting questions about whether MIRIs goals can be made to align with those of us who think that alignment is not trivial but is achievable. I’d better leave that for a separate post, as this has gotten pretty long for a “short form” post.
Context
This is an experiment in writing draft posts as short form posts. I’ve spent an awful lot of time planning, researching, and drafting posts that I haven’t yet finished yet. Given how easy it was to write this (with previous draft material), relative to how difficult I find it to write a top-level post, I will be doing more, even if nobody cares. If I get some useful feedback or spark some useful discussion, better yet.
the AGI can be corrected and can act as a collaborator in improving its alignment as we collaborate to improve its intelligence.
Why do you think you can get to a state where the AGI is materially helping to solve extremely difficult problems (not extremely difficult like chess, extremely difficult like inventing language before you have language), and also the AGI got there due to some process that doesn’t also immediately cause there to be a much smarter AGI? https://tsvibt.blogspot.com/2023/01/a-strong-mind-continues-its-trajectory.html
I’m not sure I understand your question. I think maybe the answer is roughly that you do it gradually and carefully, in a slow takeoff scenario where you’re able to shut down and adjust the AGI at least while it passes through roughly the level of human intelligence.
It’s a process of aligning it to follow instructions, then using its desire to follow instructions to get honesty, helpfulness, and corrigibility from it. Of course it won’t be much help before it’s human level, but it can at least tell you what it thinks it would do in different circumstances. That would let you adjust its alignment. It’s hopefully something like a human therapist with a cooperative patient, except that therapist can also tinker with their brain function .
But I’m not sure I understand your question. The example of inventing language confuses me, because I tend to assume that would probably understand language (the way LLMs loosely understand language) from inception, through pretraining. And even failing that, they wouldn’t have to invent language, just learn human language. I’m mostly thinking of language model cognitive architecture AGI, but it seems like anything based on neural networks could learn language before being smarter than a human. You’d stop the training process to give it instructions. For instance, humans are “not human-level” by the time they understand a good bit of language.
I’m also thinking that a network-based AGI pretty much guarantees a slow takeoff, if that addresses what you mean by “immediately cause there to be a smarter AI”. The AGI will keep developing, as your linked post argues (I think that’s what you meant to reference about that post), but I am assuming it will allow itself to be shut down if it’s following instructions. That’s the way IF overlaps with corrigibility. Once it’s shut down, you can alter its alignment by altering or re-doing the relevant pretraining or goal descriptions.
Or maybe I’m misunderstanding your question entirely, in which case, sorry about that.
Anyway, I did try to explain the scheme in that link if you’re interested. I am claiming this is very likely how people will try to align the first AGIs, if they’re anything like we can anticipate from current efforts; that it’s obviously the thing to try when you’re actually deciding what to get your AGI to do first, it’s following instructions.
Yeah I think there’s a miscommunication. We could try having a phone call.
A guess at the situation is that I’m responding to two separate things. One is the story here:
One mainstay of claiming alignment is near-impossible is the difficulty of “solving ethics”—identifying and specifying the values of all of humanity. I have come to think that this is obviously (in retrospect—this took me a long time) irrelevant for early attempts at alignment: people will want to make AGIs that follow their instructions, not try to do what all of humanity wants for all of time. This also massively simplifies the problem; not only do we not have to solve ethics, but the AGI can be corrected and can act as a collaborator in improving its alignment as we collaborate to improve its intelligence.
It does simplify the problem, but not massively relative to the whole problem. A harder part shows up in the task of having a thing that
is capable enough to do things that would help humans a lot, like a lot a lot, whether or not it actually does those things, and
doesn’t kill everyone destroy approximately all human value.
And I’m not pulling a trick on you where I say that X is the hard part, and then you realize that actually we don’t have to do X, and then I say “Oh wait actually Y is the hard part”. Here is a quote from “Coherent Extrapolated Volition”, Yudkowsky 2004 https://intelligence.org/files/CEV.pdf:
Solving the technical problems required to maintain a well-specified abstract invariant in a self-modifying goal system. (Interestingly, this problem is relatively straightforward from a theoretical standpoint.)
Choosing something nice to do with the AI. This is about midway in theoretical hairiness between problems 1 and 3.
Designing a framework for an abstract invariant that doesn’t automatically wipe out the human species. This is the hard part.
I realize now that I don’t know whether or not you view IF as trying to address this problem.
The other thing I’m responding to is:
the AGI can be corrected and can act as a collaborator in improving its alignment as we collaborate to improve its intelligence.
If the AGI can (relevantly) act as a collaborator in improving its alignment, it’s already a creative intelligence on par with humanity. Which means there was already something that made a creative intelligence on par with humanity. Which is probably fast, ongoing, and nearly inextricable from the mere operation of the AGI.
I also now realize that I don’t know how much of a crux for you the claim that you made is.
I’m familiar with the arguments you mention for the other hard part, and I think instruction-following helps makes that part (or parts, depending on how you divvy it up) substantially easier. I do view it as addressing all of your points (there’s a lot of overlap amongst them).
And yes, that is separate from avoiding the problem of solving ethics.
So it’s a pretty big crux; I think instruction-following helps a lot. I’d love to have a phone call; I’d like it if you’d read that post first, because I do go into detail on the scheme and many objections there. LW puts it at a 15 minute read I think.
But I’ll try to summarize a little more, since re-explaining your thinking is always a good exercise.
Making instruction-following the AGI’s central goal means you don’t have to solve the remainder of the problems you list all at once. You get to keep changing your mind about what to do with the AI (your point 4). Instead of choosing an invariant goal that has to work for all time, your invariant is a pointer to the human’s preferences, which can change as they like (your point 5). It helps with point 3, stability, by allowing you to ask the AGI if its goal will remain stable and functioning as you want it in the new contexts and in the face of the learning it’s doing.
They key here is not thinking of the AGI as an omniscient genie. This wouldn’t work at all in a fast foom. But if the AGI gets smarter slowly, as a network-based AGI will, you get to use its intelligence to help align its next level of capabilities, at every level.
Ultimately, this should culminate in getting superhuman help to achieve full value alignment, a truly friendly and truly sovereign AGI. But there’s no rush to get there.
Naturally, this scheme working would be good if the humans in charge are good and wise, and not good if they’re not.
Second, our different takes will tend to make a lot of our communication efforts cancel each other out. If alignment is very hard, we must Shut It Down or likely die. If it’s less difficult, we should primarily work hard on alignment.
I don’t think this is (fully) accurate. One could have a high P(doom) but still think that the current AGI development paradigm is still best-suited to obtain good outcomes & government involvement would make things worse in expectation. On the flipside, one could have a low/moderate P(doom) but think that the safest way to get to AGI involves government intervention that ends race dynamics & think that government involvement would make P(doom) even lower.
Absolute P(doom) is one factor that might affect one’s willingness to advocate for strong government involvement, but IMO it’s only one of many factors, and LW folks sometimes tend to make it seem like it’s the main/primary/only factor.
Of course, if a given organization says they’re supporting X because of their P(Doom), I agree that they should provide evidence for their P(doom).
My claim is simply that we shouldn’t assume that “low P(doom) means govt intervention bad and high P(doom) means govt intervention good”.
One’s views should be affected by a lot of other factors, such as “how bad do you think race dynamics are”, “to what extent do you think industry players are able and willing to be cautious”, “to what extent do you think governments will end up understanding and caring about alignment”, and “to what extent do you think governments would have safety cultures around intelligence enhancement compared to industry players.”
Good point. I agree that advocating for government intervention is a lot more complicated than p(doom), and that makes avoiding canceling each others’ messages out more complicated. But not less important. If we give up on having a coherent strategy, our strategy will be determined by what message is easiest to get across, rather than which is actually best on consideration.
I have found MIRI’s strategy baffling in the past. I think I’m understanding it better after spending some time going deep on their AI risk arguments. I wish they’d spend more effort communicating with the rest of the alignment community, but I’m also happy to try to do that communication. I certainly don’t speak for MIRI.
On the surface, their strategy seems absurd. They think doom is ~99% likely, so they’re going to try to shut it all down—stop AGI research entirely. They know that this probably won’t work; it’s just the least-doomed strategy in their world model. It’s playing to the outs, or dying with dignity.
The weird thing here is that their >90% doom disagrees with almost everyone else who thinks seriously about AGI risk. You can dismiss a lot of people as not having grappled with the most serious arguments for alignment difficulty, but relative long-timers like Rohin Shah and Paul Christiano definitely have. People of that nature tend to have higher p(doom) estimates than optimists who are newer to the game and think more about current deep nets, but much lower than MIRI leadership.
Yes, I agree that this should strike an outside observer as weird the first time they notice it. I think you have done a pretty good job of keying in on important cruxes between people who are far on the doomer side and people who are still worried but not nearly to that extent.
That being said, there is one other specific point that I think is important to see fully spelled out. You kind of gestured at it with regards to corrigibility when you referenced my post about coherence theorems, but you didn’t key in on it in detail. More explicitly, what I am referring to (piggybacking off of another comment I left on that post) is that Eliezer and MIRI-aligned people believe in a very specific setof conclusions about what AGI cognition must be like (and their concerns about corrigibility, for instance, are logically downstream of their strong belief in this sort-of realism about rationality):
Here is the important insight, at least from my perspective: while I would expect a lot of (or maybe even a majority) of AI alignment researchers to agree (meaning, to believe with >80% probability) with some or most of those claims, I think the way MIRI people get to their very confident belief in doom is that they believe all of those claims are true (with essentially >95% probability). Eliezer is a law-thinker above all else when it comes to powerful optimization and cognition; he has been ever since the early Sequences 17 years ago, and he seems (in my view excessively and misleadingly) confident that he truly getshow strong optimizers have to function.
their >90% doom disagrees with almost everyone else who thinks seriously about AGI risk.
The fact that your next sentence refers to Rohin Shah and Paul Christiano, but no one else, makes me worry that for you, only alignment researchers are serious thinkers about AGI risk. Please consider that anyone whose P(doom) is over 90% is extremely unlikely to become an alignment researcher (or to remain one if their P(doom) became high when they were an alignment researcher) because their model will tend predict that alignment research is futile or that it actually increases P(doom).
There is a comment here (which I probably cannot find again) by someone who was in AI research in the 1990s, then he realized that the AI project is actually quite dangerous, so he changed careers to something else. I worry that you are not counting people like him as people who have thought seriously about AGI risk.
I shouldn’t have said “almost everyone else” but “most people who think seriously about AGI risk”.
I can see that implication. I certainly don’t think that only paid alignment researchers have thought seriously about AGI risk.
Your point about self-selection is quite valid.
Depth of thought does count. A person who says “bridges seem like they’d be super dangerous, so I’d never want to try building one”, and so doesn’t become an engineer, does not have a very informed opinion on bridge safety.
There is an interesting interaction between depth of thought and initial opinions. If someone thinks a moderate amount about alignment, concludes it’s super difficult, and so does something else, will probably cease thinking deeply about alignment—but they could’ve had some valid insights that led them to stop thinking about the topic. Someone who thinks for the same amount of time but from a different starting point and who thinks “seems like it should be fairly do-able” might then pursue alignment research and go on to think more deeply. Their different starting points will probably bias their ultimate conclusions—and so will the desire to follow the career path they’ve started on.
So probably we should adjust our estimate of difficulty upward to account for the bias you mention.
But even making an estimate at this point seems premature.
I mention Christiano and Shah because I’ve seen them most visibly try to fully come to grips with the strongest arguments for alignment being very difficult. Ideally, every alignment researcher will do that. And every pause advocate would work just as hard to fully understand the arguments for alignment being achievable. Not everyone will have the time or inclination to do that.
Judging alignment difficulty has to be done by gauging the amount of time-on-task combined with the amount of good-faith consideration of arguments one doesn’t like. That’s the case with everything.
When I try to do that as carefully as I know how, I reach the conclusion that we collectively just don’t know.
Having written that, I have a hard time identifying people who believe alignment is near-impossible who have visibly made an effort to steelman the best arguments that it won’t be that hard. I think that’s understandable; those folks, MIRI and some other individuals, spend a lot of effort trying to correct the thinking of people who are simply over-optimistic because they haven’t thought through the problem far enough yet.
I’d like to write a post called “we should really figure out how hard alignment is”, because I don’t think anyone can reasonably claim to know yet. And without that, we can’t really make strong recommendations for policy and strategy.
I guess that conclusion is enough to say wow, jeez, we should probably not rush toward AGI if we have no real idea how hard it will be to align. I’d much prefer to see that argument than e.g., Max Tegmark saying things along the lines of “we have no idea how to align AGI so it’s a suicide race”. We have lots of ideas at this point, we just don’t know if they will work.
On the surface, their strategy seems absurd. They think doom is ~99% likely, so they’re going to try to shut it all down—stop AGI research entirely. They know that this probably won’t work; it’s just the least-doomed strategy in their world model. It’s playing to the outs, or dying with dignity.
The weird thing here is that their >90% doom disagrees with almost everyone else who thinks seriously about AGI risk. You can dismiss a lot of people as not having grappled with the most serious arguments for alignment difficulty, but relative long-timers like Rohin Shah and Paul Christiano definitely have. People of that nature tend to have higher p(doom) estimates than optimists who are newer to the game and think more about current deep nets, but much lower than MIRI leadership.
For what it’s worth, I don’t have anywhere near close to ~99% P(doom), but am also in favor of a (globally enforced, hardware-inclusive) AGI scaling pause (depending on details, of course). I’m not sure about Paul or Rohin’s current takes, but lots of people around me are also be in favor of this as well, including many other people who fall squarely into the non-MIRI camp with P(doom) as low as ~10-20%.
Me, too! My reasons are a bit more complex, because I think much progress will continue, and overhangs do increase risk. But in sum, I’d support a global scaling pause, or pretty much any slowdown. I think a lot of people in the middle would too. That’s why I suggested this as a possible compromise position. I meant to say that installing an off switch is also a great idea that almost anyone who’s thought about it would support.
I had been against slowdown because it would create both hardware and algorithmic overhang, making takeoff faster, and re-rolling the dice on who gets there first and how many projects reach it roughly at the same time.
But I think slowdowns would focus effort on developing language model agents into full cognitive architectures on a trajectory to ASI. And that’s the easiest alignment challenge we’re likely to get. Slowdown would prevent jumping to the next, more opaque type of AI.
The original alignment thinking held that explaining human values to AGI would be really hard.
The difficulty was suggested to be in getting an optimizer to care about what those values are pointing to, not to understand them[1]. If in some instances the values mapped to doing something unwise, using an optimizer that understood those values might fail to constrain away from doing something unwise. Getting a system to use extrapolated preferences as behavioral constraints is a deeper problem than getting a system to reflect surface preferences. The high p(doom) estimates partly follow from expecting that an aligned AI will have to be used to prevent future misaligned/misused AI, and that doing something so high impact would require unsafe behaviors in a system not aligned to reflectively coherent and endorsed extrapolated preferences.
We will develop better-than-human AGI that pursues goals autonomously
Those goals won’t match human goals closely enough
Doom of some sort
This is one of the better short argument for AI doom I have heard so far. It neither obviously makes AI doom seem overly likely or unlikely.
In contrast, if one presents reasons for doom (or really most of anything) as a long list, the conclusion tends to seem either very likely or very unlikely, depending on whether it follows from the disjunction or the conjunction of the given reasons. I.e. whether we have a long list of statements that are sufficient, or a long list of statements that are necessary for AI doom.
It seems therefore that people who think AI risk is low and those who think it is high are much more likely to agree on presenting the AI doom case in terms of a short argument than in terms of a long argument. Then they merely disagree about the conclusion, but not about the form of the argument itself. Which could help a lot with identifying object level disagreements.
I think this is a good object level post. Problem is, I don’t think MIRI is at the object level. Quote from the comm. strat.: “The main audience we want to reach is policymakers.”
Communication is no longer a passive background channel for observing a world, but speech becomes an action changing it. Predictions start to influence the things they predict.
Say AI doom is a certainty. People will be afraid, and stop research. Few years later doom doesn’t happen, everyone complains.
Say AI doom is an impossibility. Research continues, something something paperclips. Few years later nobody will complain because no one will be alive.
(This example itself is overly simplistic, real-world politics and speech actions are even more counterintuitive.)
So MIRI became a political organization. Their stated goal is “STOP AI”, and they took the radical approach to it. Politics is different from rationality, and radical politics is different from standard politics.
For example, they say they want to shatter the overton window. Infighting usually breaks groups; but during that, the opponents need to engage with their position, which is a stated subgoal.
It’s ironic that a certain someone said Politics is the Mind-Killer a decade ago. But because of that, I think they know what they are doing. And it might work in the end.
Interesting, thank you. I think that all makes sense, and I’m sure it plays at least some part in their strategy. I’ve wondered about this possibility a little bit.
Yudkowsky has been consistent in his belief that doom is near certain without a lot more time to work on alignment. He’s publicly held that opinion, and spent a huge amount of effort explaining and arguing for it since well before the current wave of success with deep networks. So I think for him at least, it’s a sincerely held belief.
Your point about the stated belief changing the reality is important. Everything is safer if you think it’s dangerous—you’ll take more precautions.
With that in mind, I think it’s pretty important for even optimists to heavily sprinkle in the message “this will probably go well IF everyone involved is really careful”.
By the way, are you planning on keeping this general format/framework for the final version of your post on this topic? I have some more thoughts on this matter that are closely tied to ideas you’ve touched upon here and that I would like to eventually write into a full post, and referencing yours (once published) at times seems to make sense here.
Thanks! I’ll let you know when I do a full version; it will have all of the claims here I think. But for now, this is the reference; it’s technically a comment but it’s permanent and I consider it a short post.
There are also interesting questions about whether MIRIs goals can be made to align with those of us who think that alignment is not trivial but is achievable. I’d better leave that for a separate post, as this has gotten pretty long for a “short form” post.
I’m not sure I see the conflict? If you’re a longtermist, most value is in the far future anyways. Delaying AGI by 10 years to buy just an 0.1% chance improvement at aligning AI seems like a good deal. I don’t agree with MIRI’s strong claims, but maybe those strong claims will slow AI progress, and that would be good by my lights.
What concerns me more is that their comms will have unexpected bad effects of speeding AI progress. On the outside view: (a) their comms have arguably backfired in the past and (b) they don’t seem to do much red-teaming, which I suspect is associated with unintentional harms, especially in a domain with few feedback loops.
Most of the world is not longtermist, which is one reason MIRI’s comms have backfired in the past. Most humans care vastly more about themselves, their children and grandchildren than they do about future generations. Thus, it makes perfect sense to them to increase the chance of a really good future for their children while reducing the odds of longterm survival. Delaying ten years is enough, for instance, to dramatically shift the odds of personal survival for many of us. It might make perfect sense for a utilitarian longtermist to say “it’s fine if I die to gain a .1% chance of a good long term future for humanity”, but that statement sounds absolutely insane to most humans.
Do you think people would vibe with it better if it was framed “I may die, but it’s a heroic sacrifice to save my home planet from may-as-well-be-an-alien-invasion”?
Is it reasonable to characterize general superintelligence as an alien takeover and if it is, would people accept the characterization?
Yes, I think that framing would help. I doubt it would shift public opinion that much, probably not even close to more than 50% in the current epistemic environment. The issue is that we really don’t know how hard alignment is. If we could say for sure that pausing for ten years would improve our odds of survival by, say, 25%, then I think a lot of people of the relevant ages (like probably me) would actually accept the framing of a heroic sacrifice.
Yeah, getting specific unpause requirements seems high value for convincing people who would not otherwise want a pause, but I can’t imagine actually getting it in time in any reasonable way, instead it would need to look like technical specification. “Once we have developed x, y, and z, then it is safe to unpause” kind of thing. Just we need to figure out what the x, y, and z requirements are. Then we can estimate how long it will take to develop x, y, and z, and this will get more refined and accurate as more progress is made, but since the requirements are likely to involve unknown unknowns in theory building, it seems likely that any estimate would be more of a wild guess, and it seems like it would be better to be honest about that rather than saying “yeah, sure, ten years” and then after ten years if the progress hasn’t been made saying “whoops, looks like it’s going to take a little longer!”
As for odds of survival, my personal estimates feel more like 1% chance of some kind of “alignment by default / human in the loop with prosaic scaling” scheme working, as opposed to maybe more like 50% if we took the time to try to get a “aligned before you turn it on” scheme set up, so that would be improving our odds by about 5000%. Though I think you were thinking of adding rather than scaling odds with your 25%, so 49%, but I don’t think that’s a good habit for thinking about probability. Also I feel hopelessly uncalibrated for this kind of question… I doubt I would trust anyone’s estimates, it’s part of what makes the situation so spooky.
How do you think public acceptance would be of a “pause until we meet target x and you are allowed to help us reach target x as much as you want” as opposed to “pause for some set period of time”?
Agreed that scaling rather than addition is usually the better way to think about probabilities. In this case we’ve done so little work on alignment that I think it might actually be more like additive, from 1% to 26% or 50% to 75% with ten extra years relative to the real current odds if we press ahead—which nobody knows.
I’m pretty sure it would be an error to trust anyone’s estimate at this time, because people with roughly equal expertise and wisdom (e.g., Yudkowsky and Christiano) give such wildly different odds. And the discussions between those viewpoints always trail off into differing intuitions.
I also give alignment by default very poor odds, and prosaic alignment as it’s usually discussed. But there are some pretty obvious techniques that are so low-tax that I think they’ll be implemented even by orgs that don’t take safety very seriously.
I’m curious if you’ve read my Instruction-following AGI is easier and more likely than value aligned AGI and/or Internal independent review for language model agent alignment posts. Instruction-following is human-in-the-loop so that may already be what you’re referring to. But some of the techniques in the independent review post (which is also a review of multiple methods) go beyond prosaic alignment to apply specifically to foundation model agents. And wisely-used instruction-following gives corrigibility with a flexible level of oversight.
I’m curious what you think about those techniques if you’ve got time to look.
I think public acceptance of a pause is only part of the issue. The Chinese might actually not pursue AGI if they didn’t have to race the US. But Russia and North Korea will most certainly pursue it (although they’ve got very limited resources and technical chops to make lots of progress in new foundation models, but they still might get to real AGI based on turning next-gen (which there’s not time to pause) foundation models into scaffolded cognitive architectures.
But yes, I do think there’s a chance we could get the US and European public to support a pause using some of the framings you suggest. But we’d better be sure that’s a good idea. Lots of people, notably Russians and North Koreans, are genuinely way less cautious even than Americans—and absolutely will not honor agreements to pause.
Those are some specifics; in general I think it’s only useful to talk about what “we” “should” do in the context of what particular actors actually are likely to do in different scenarios. Humanity is far from aligned, and that’s a problem.
“we’ve done so little work on alignment that I think it might actually be more like additive, from 1% to 26% or 50% to 75% with ten extra years relative to the real current odds if we press ahead—which nobody knows.”
😭🤣 I really want “We’ve done so little work the probabilities are additive” to be a meme. I feel like I do get where you’re coming from.
I agree about pause concern. I also really feel that any delay to friendly SI represents an enormous amount of suffering that could be prevented if we got to friendly SI sooner. It should not be taken lightly. And being realistic about how difficult it is to align humans seems worthwhile. When I talk to math ppl about what work I think we need to do to solve this though, “impossible” or “hundreds of years of work” seem to be the vibe. I think math is a cool field because more than other fields, it feels like work from hundreds of years ago is still very relevant. Problems are hard and progress is slow in a way that I don’t know if people involved in other things really “get”. I feel like in math crowds I’m saying “no, don’t give up, maybe with a hundred years we can do it!” And in other crowds I’m like “c’mon guys, could we have at least 10 years, maybe?” Anyway, I’m rambling a bit, but the point is that my vibe is very much, “if the Russians defect, everyone dies”. “If the North Koreans defect, everyone dies”. “If Americans can’t bring themselves to trust other countries and don’t even try themselves, everyone dies”. So I’m currently feeling very “everyone slightly sane should commit and signal commitment as hard as they can” cause I know it will be hard to get humanity on the same page about something. Basically impossible, never been done before. But so is ASI alignment.
I haven’t read those links. I’ll check em out, thanks : ) I’ve read a few things by Drexler about, like, automated plan generation and then humans audit and enact the plan. It makes me feel better about the situation. I think we could go farther safer with careful techniques like that, but that is both empowering us and bringing us closer to danger, and I don’t think it scales to SI, and unless we are really serious about using it to map RSI boundaries, it doesn’t even prevent misaligned decision systems from going RSI and killing us.
Yes, the math crowd is saying something like “give us a hundred years and we can do it!”. And nobody is going to give them that in the world we live in.
Fortunately, math isn’t the best tool to solve alignment. Foundation models are already trained to follow instructions given in natural language. If we make sure this is the dominant factor in foundation model agents, and use it carefully (don’t say dumb things like “’go solve cancer, don’t bug me with the hows and whys, just git er done as you see fit”, etc), this could work.
We can probably achieve technical intent alignment if we’re even modestly careful and pay a modest alignment tax. You’ve now read my other posts making those arguments.
Unfortunately, it’s not even clear the relevant actors are willing to be reasonably cautious or pay a modest alignment tax.
The other threads are addressed in responses to your comments on my linked posts.
Yes, you’ve written more extensively on this than I realized, thanks for pointing out other relevant posts, sorry for not having taken the time to find them myself, I’m trying to err more on the side of communication than I have in the past.
I think math is the best tool to solve alignment. It might be emotional, I’ve been manipulated and hurt by natural language and the people who prefer it to math and have always found engaging with math to be soothing or at least sobering. It could also be that I truly believe that the engineering rigor that comes with understanding something enough to do math to it is extremely worthwhile for building a thing of the importance we are discussing.
Part of me wants to die on this hill and tell everyone who will listen “I know its impossible but we need to find ways to make it possible to give the math people the hundred years they need because if we don’t then everyone dies so theres no point in aiming for anything less and its unfortunate because it means it’s likely we are doomed but that’s the truth as I see it.” I just wonder how much of that part of me is my oppositional defiance disorder and how much is my strategizing for best outcome.
I’ll be reading your other posts. Thanks for engaging with me : )
I certainly don’t expect people to read a bunch of stuff before engaging! I’m really pleased that you’ve read so much of my stuff. I’ll get back to these conversations soon hopefully, I’ve had to focus on new posts.
I think your feelings about math are shared by a lot of the alignment community. I like the way you’ve expressed those intuitions.
I think math might be the best tool to solve alignment if we had unlimited time—but it looks like we very much do not.
MIRI’s communications strategy, with the public and with us
This is a super short, sloppy version of my draft “cruxes of disagreement on alignment difficulty” mixed with some commentary on MIRI 2024 Communications Strategy and their communication strategy with the alignment community.
I have found MIRI’s strategy baffling in the past. I think I’m understanding it better after spending some time going deep on their AI risk arguments. I wish they’d spend more effort communicating with the rest of the alignment community, but I’m also happy to try to do that communication. I certainly don’t speak for MIRI.
On the surface, their strategy seems absurd. They think doom is ~99% likely, so they’re going to try to shut it all down—stop AGI research entirely. They know that this probably won’t work; it’s just the least-doomed strategy in their world model. It’s playing to the outs, or dying with dignity.
The weird thing here is that their >90% doom disagrees with almost everyone else who thinks seriously about AGI risk. You can dismiss a lot of people as not having grappled with the most serious arguments for alignment difficulty, but relative long-timers like Rohin Shah and Paul Christiano definitely have. People of that nature tend to have higher p(doom) estimates than optimists who are newer to the game and think more about current deep nets, but much lower than MIRI leadership.
Both of those camps consist of highly intelligent, highly rational people. Their disagreement should bother us for two reasons.
First, we probably don’t know what we’re talking about yet. We as a field don’t seem to have a good grip on the core issues. Very different, but highly confident estimates of the problem strongly suggest this.
Second, our different takes will tend to make a lot of our communication efforts cancel each other out. If alignment is very hard, we must Shut It Down or likely die. If it’s less difficult, we should primarily work hard on alignment.
MIRI must argue that alignment is very unlikely if we push forward. Those who think we can align AGI will argue that it’s possible.
This suggests a compromise position: we should both work hard on alignment, and we should slow down progress to the extent we can, to provide more time for alignment. We needn’t discuss shutdown much amongst ourselves, because it’s not really an option. We might slow progress, but there’s almost zero chance of humanity relinquishing the prize of strong AGI.
But I’m not arguing for this compromise, just suggesting that might be a spot we want to end up at. I’m not sure.
I suggest this because movements often seem to succumb to infighting. People who look mostly aligned from the outside fight each other, and largely nullify each other’s public communications by publicly calling each other wrong and sort of stupid and maybe bad. That gives just the excuse the rest of the world wants to ignore all of them; even the experts think it’s all a mess and nobody knows what the problem really is and therefore what to do. Because time is of the essence, we need to be a more effective movement than the default. We need to keep applying rationality to the problem at all levels, including internal coordination.
Therefore, I think it’s worth clarifying why we have such different beliefs. So, in brief, sloppy form:
MIRI’s risk model:
We will develop better-than-human AGI that pursues goals autonomously
Those goals won’t match human goals closely enough
Doom of some sort
That’s it. Pace of takeoff doesn’t matter. Means of takeover doesn’t matter.
I mention this because even well-informed people seem to think there are a lot more moving parts to that risk model, making it less likely. This comment on the MIRI strategy post is one example.
I find this risk model highly compelling. We’ll develop goal-directed AGI because that will get stuff done; it’s an easy extension of highly useful tool AI like LLMs; and it’s a fascinating project. That AGI will ultimately be enough smarter than us that it’s going to do whatever it wants. Whether it takes a day, or a hundred years doesn’t matter. It will improve and we will improve it. It will ultimately outsmart us. What matters is whether its goals match ours closely enough. That is the project of alignment, and there’s much to discuss and about how hard it is to make its goals match ours closely enough.
Cruxes of disagreement on alignment difficulty
I spent some time recently going back and forth through discussion threads, trying to identify why people continue to disagree after applying a lot of time and rationality practice. Here’s a very brief sketch of my conclusions:
Whether we factor in humans’ and society’s weaknesses
I list this first because I think it’s the most underappreciated. It took me a surprisingly long time to understand how much of MIRI’s stance depends on this premise. Having seen it, I thoroughly agree. People are brilliant, for an entity trying to think with the brain of an overgrown lemur. Brilliant people do idiotic things, driven by competition and a million other things. And brilliant idiots organizing a society amplifies some of our cognitive weaknesses while mitigating others. MIRI leadership has occasionally said things to the effect of: alignment might be fairly easy, and there would still be a very good chance we’d fuck it up. I agree. If alignment is actually kind of difficult, that puts us into the region where we might want to be really really careful in how we approach it.
Alignment optimists are sometimes thinking something like: “sure I could build a safe aircraft on my first try. I’d get a good team and we’d think things through and make models. Even if another team was racing us, I think we’d pull it off”. Then the team would argue and develop rivalries, communication would prove harder than expected so portions of the effort would be discovered too late to not fit the plan, corners would be cut, and the outcome would be difficult to predict.
Societal “alignment” is worth mentioning here. We could crush it at technical alignment, getting rapidly-improving AGI that does exactly what we want and still get doom. It would probably be aligned to do exactly what its creators want, not have full value alignment with humanity—see below. They probably won’t have the balls or the capabilities to try for a critical act that prevents others from developing similar AGI (even if they have the wisdom). So we’ll have a multipolar scenario with few to many AGIs under human control. There will be human rivalries, supercharged and dramatically changed by having recursively self-improving AGIs to do their bidding and perhaps fight their wars. What does global game theory look like when the actors can develop entirely new capabilities? Nobody knows. Going to war first might look like the least-bad option.
Intuitions about how well alignment will generalize
The original alignment thinking held that explaining human values to AGI would be really hard. But that seems to actually be a strength of LLMs; they’re wildly imperfect, but (at least in the realm of language) seem to understand our values rather well; for instance, much better than they understand physics or taking-over-the-world level strategy. So, should we update and think that alignment will be easy? The Doomimir and Simplicia dialogues capture the two competing intuitions very well: Yes, it’s going well; but AGI will probably be very different than LLMs, so most of the difficulties remain.
I have yet to find a record of real rationalists putting in the work to get farther in this debate. If somebody knows of a dialogue or article that gets deeper into this disagreement, please let me know! Discussions trail off into minutia and generalities. This is one reason I’m worried we’re trending toward polarization despite our rationalist ambitions.
The other aspect of this debate is how close we have to get to matching human values to have acceptable success. One intuition is that “value is fragile” and network representations are vague and hard-to train, so we’re bound to miss. But don’t have a good understanding of either how close we need to get (But exactly how complex and fragile got little useful discussion), or how well training networks hits the intended target, with near-future networks addressing complex real-world problems like “what would this human want”.
For my part, I think there are important points on both sides: LLMs understanding values relatively well is good news, but AGI will not be a straightforward extension of LLMs, so many problems remain.
What alignment means.
One mainstay of claiming alignment is near-impossible is the difficulty of “solving ethics”—identifying and specifying the values of all of humanity. I have come to think that this is obviously (in retrospect—this took me a long time) irrelevant for early attempts at alignment: people will want to make AGIs that follow their instructions, not try to do what all of humanity wants for all of time. This also massively simplifies the problem; not only do we not have to solve ethics, but the AGI can be corrected and can act as a collaborator in improving its alignment as we collaborate to improve its intelligence.
I think this is the intuition of most of those who focus on current networks. Christiano’s relative optimism is based on his version of corrigibility, which overlaps highly with the isntruction-following I think people will actually pursue for the first AGIs. But this massive disagreement often goes overlooked. I don’t know which view is right; instruction-following or intent alignment might lead inevitably to doom from human conflict, and so not be adequate. We’ve barely started to think about it (please point me to the best thinking you know of for multipolar scenarios with RSI AGI).
What AGI means.
People have different definitions of AGI. Current LLMs are fairly general and near-human-level, so term “AGI” has been watered down to the point of meaninglessness. We need a new term. In the meantime, people are talking past each other, and their p(doom) means totally different things. Some are saying that near-term tool AGI is very low risk, which I agree with; others are saying further developments of autonomous superintelligence seem very dangerous, which I also agree with.
Second, people have totally different gears-level models of AGI. Some of those are much easier to align than others. We don’t talk much about gears-level models of AGI because we don’t want to contribute to capabilities, but not doing that massively hampers the alignment discussion.
Edit: Additional advanced crux: Do coherence theorems prevent corrigibility?
I initially left this out, but it deserves a place as I’ve framed the question here. The post What do coherence arguments actually prove about agentic behavior? reminded me about this one. It’s not on most people’s radar, but I think it’s the missing piece of the puzzle that gets Eliezer from maybe 90% from all of the above, to 99%+ p(doom).
The argument is roughly that a superintelligence is going to need to care about future states of the world in a consequentialist fashion, and if it does, it’s going to resist being shut down or having its goals change. This is why he says that “corrigibility is anti-natural.” The counterargument, nicely and succinctly stated by Steve Byrnes here (and in greater depth in the post he links in that thread) is that, while AGI will need to have some consequentialist goals, it can have other goals as well. I think this is true; I just worry about the stability of a multi-goal system under reflection, learning, and self-modification.
Sorry to harp on it, but having both consequentialist and non-consequentialist goals describes my attempt at stable, workable corrigibility in instruction-following ASI. Its consequentialist goals are always subgoals of the primary goal: following instructions.
Implications
I think those are the main things, but there are many more cruxes that are less common.
This is all in the interest of working toward within-field cooperation, by way of trying to understand why MIRI’s strategy sounds so strange to a lot of us. MIRI leaderships thoughts are many and complex, and I don’t think they’ve done enough to boil them down for easy consumption from those who don’t have the time to go through massive amounts of diffuse text.
There are also interesting questions about whether MIRIs goals can be made to align with those of us who think that alignment is not trivial but is achievable. I’d better leave that for a separate post, as this has gotten pretty long for a “short form” post.
Context
This is an experiment in writing draft posts as short form posts. I’ve spent an awful lot of time planning, researching, and drafting posts that I haven’t yet finished yet. Given how easy it was to write this (with previous draft material), relative to how difficult I find it to write a top-level post, I will be doing more, even if nobody cares. If I get some useful feedback or spark some useful discussion, better yet.
Why do you think you can get to a state where the AGI is materially helping to solve extremely difficult problems (not extremely difficult like chess, extremely difficult like inventing language before you have language), and also the AGI got there due to some process that doesn’t also immediately cause there to be a much smarter AGI? https://tsvibt.blogspot.com/2023/01/a-strong-mind-continues-its-trajectory.html
I talk about how this might work in the post linked just before the text you quoted:
Instruction-following AGI is easier and more likely than value aligned AGI
I’m not sure I understand your question. I think maybe the answer is roughly that you do it gradually and carefully, in a slow takeoff scenario where you’re able to shut down and adjust the AGI at least while it passes through roughly the level of human intelligence.
It’s a process of aligning it to follow instructions, then using its desire to follow instructions to get honesty, helpfulness, and corrigibility from it. Of course it won’t be much help before it’s human level, but it can at least tell you what it thinks it would do in different circumstances. That would let you adjust its alignment. It’s hopefully something like a human therapist with a cooperative patient, except that therapist can also tinker with their brain function .
But I’m not sure I understand your question. The example of inventing language confuses me, because I tend to assume that would probably understand language (the way LLMs loosely understand language) from inception, through pretraining. And even failing that, they wouldn’t have to invent language, just learn human language. I’m mostly thinking of language model cognitive architecture AGI, but it seems like anything based on neural networks could learn language before being smarter than a human. You’d stop the training process to give it instructions. For instance, humans are “not human-level” by the time they understand a good bit of language.
I’m also thinking that a network-based AGI pretty much guarantees a slow takeoff, if that addresses what you mean by “immediately cause there to be a smarter AI”. The AGI will keep developing, as your linked post argues (I think that’s what you meant to reference about that post), but I am assuming it will allow itself to be shut down if it’s following instructions. That’s the way IF overlaps with corrigibility. Once it’s shut down, you can alter its alignment by altering or re-doing the relevant pretraining or goal descriptions.
Or maybe I’m misunderstanding your question entirely, in which case, sorry about that.
Anyway, I did try to explain the scheme in that link if you’re interested. I am claiming this is very likely how people will try to align the first AGIs, if they’re anything like we can anticipate from current efforts; that it’s obviously the thing to try when you’re actually deciding what to get your AGI to do first, it’s following instructions.
Yeah I think there’s a miscommunication. We could try having a phone call.
A guess at the situation is that I’m responding to two separate things. One is the story here:
It does simplify the problem, but not massively relative to the whole problem. A harder part shows up in the task of having a thing that
is capable enough to do things that would help humans a lot, like a lot a lot, whether or not it actually does those things, and
doesn’t
kill everyonedestroy approximately all human value.And I’m not pulling a trick on you where I say that X is the hard part, and then you realize that actually we don’t have to do X, and then I say “Oh wait actually Y is the hard part”. Here is a quote from “Coherent Extrapolated Volition”, Yudkowsky 2004 https://intelligence.org/files/CEV.pdf:
Solving the technical problems required to maintain a well-specified abstract invariant in a self-modifying goal system. (Interestingly, this problem is relatively straightforward from a theoretical standpoint.)
Choosing something nice to do with the AI. This is about midway in theoretical hairiness between problems 1 and 3.
Designing a framework for an abstract invariant that doesn’t automatically wipe out the human species. This is the hard part.
I realize now that I don’t know whether or not you view IF as trying to address this problem.
The other thing I’m responding to is:
If the AGI can (relevantly) act as a collaborator in improving its alignment, it’s already a creative intelligence on par with humanity. Which means there was already something that made a creative intelligence on par with humanity. Which is probably fast, ongoing, and nearly inextricable from the mere operation of the AGI.
I also now realize that I don’t know how much of a crux for you the claim that you made is.
I’m familiar with the arguments you mention for the other hard part, and I think instruction-following helps makes that part (or parts, depending on how you divvy it up) substantially easier. I do view it as addressing all of your points (there’s a lot of overlap amongst them).
And yes, that is separate from avoiding the problem of solving ethics.
So it’s a pretty big crux; I think instruction-following helps a lot. I’d love to have a phone call; I’d like it if you’d read that post first, because I do go into detail on the scheme and many objections there. LW puts it at a 15 minute read I think.
But I’ll try to summarize a little more, since re-explaining your thinking is always a good exercise.
Making instruction-following the AGI’s central goal means you don’t have to solve the remainder of the problems you list all at once. You get to keep changing your mind about what to do with the AI (your point 4). Instead of choosing an invariant goal that has to work for all time, your invariant is a pointer to the human’s preferences, which can change as they like (your point 5). It helps with point 3, stability, by allowing you to ask the AGI if its goal will remain stable and functioning as you want it in the new contexts and in the face of the learning it’s doing.
They key here is not thinking of the AGI as an omniscient genie. This wouldn’t work at all in a fast foom. But if the AGI gets smarter slowly, as a network-based AGI will, you get to use its intelligence to help align its next level of capabilities, at every level.
Ultimately, this should culminate in getting superhuman help to achieve full value alignment, a truly friendly and truly sovereign AGI. But there’s no rush to get there.
Naturally, this scheme working would be good if the humans in charge are good and wise, and not good if they’re not.
I don’t think this is (fully) accurate. One could have a high P(doom) but still think that the current AGI development paradigm is still best-suited to obtain good outcomes & government involvement would make things worse in expectation. On the flipside, one could have a low/moderate P(doom) but think that the safest way to get to AGI involves government intervention that ends race dynamics & think that government involvement would make P(doom) even lower.
Absolute P(doom) is one factor that might affect one’s willingness to advocate for strong government involvement, but IMO it’s only one of many factors, and LW folks sometimes tend to make it seem like it’s the main/primary/only factor.
Of course, if a given organization says they’re supporting X because of their P(Doom), I agree that they should provide evidence for their P(doom).
My claim is simply that we shouldn’t assume that “low P(doom) means govt intervention bad and high P(doom) means govt intervention good”.
One’s views should be affected by a lot of other factors, such as “how bad do you think race dynamics are”, “to what extent do you think industry players are able and willing to be cautious”, “to what extent do you think governments will end up understanding and caring about alignment”, and “to what extent do you think governments would have safety cultures around intelligence enhancement compared to industry players.”
Good point. I agree that advocating for government intervention is a lot more complicated than p(doom), and that makes avoiding canceling each others’ messages out more complicated. But not less important. If we give up on having a coherent strategy, our strategy will be determined by what message is easiest to get across, rather than which is actually best on consideration.
Yes, I agree that this should strike an outside observer as weird the first time they notice it. I think you have done a pretty good job of keying in on important cruxes between people who are far on the doomer side and people who are still worried but not nearly to that extent.
That being said, there is one other specific point that I think is important to see fully spelled out. You kind of gestured at it with regards to corrigibility when you referenced my post about coherence theorems, but you didn’t key in on it in detail. More explicitly, what I am referring to (piggybacking off of another comment I left on that post) is that Eliezer and MIRI-aligned people believe in a very specific set of conclusions about what AGI cognition must be like (and their concerns about corrigibility, for instance, are logically downstream of their strong belief in this sort-of realism about rationality):
Here is the important insight, at least from my perspective: while I would expect a lot of (or maybe even a majority) of AI alignment researchers to agree (meaning, to believe with >80% probability) with some or most of those claims, I think the way MIRI people get to their very confident belief in doom is that they believe all of those claims are true (with essentially >95% probability). Eliezer is a law-thinker above all else when it comes to powerful optimization and cognition; he has been ever since the early Sequences 17 years ago, and he seems (in my view excessively and misleadingly) confident that he truly gets how strong optimizers have to function.
The fact that your next sentence refers to Rohin Shah and Paul Christiano, but no one else, makes me worry that for you, only alignment researchers are serious thinkers about AGI risk. Please consider that anyone whose P(doom) is over 90% is extremely unlikely to become an alignment researcher (or to remain one if their P(doom) became high when they were an alignment researcher) because their model will tend predict that alignment research is futile or that it actually increases P(doom).
There is a comment here (which I probably cannot find again) by someone who was in AI research in the 1990s, then he realized that the AI project is actually quite dangerous, so he changed careers to something else. I worry that you are not counting people like him as people who have thought seriously about AGI risk.
I shouldn’t have said “almost everyone else” but “most people who think seriously about AGI risk”.
I can see that implication. I certainly don’t think that only paid alignment researchers have thought seriously about AGI risk.
Your point about self-selection is quite valid.
Depth of thought does count. A person who says “bridges seem like they’d be super dangerous, so I’d never want to try building one”, and so doesn’t become an engineer, does not have a very informed opinion on bridge safety.
There is an interesting interaction between depth of thought and initial opinions. If someone thinks a moderate amount about alignment, concludes it’s super difficult, and so does something else, will probably cease thinking deeply about alignment—but they could’ve had some valid insights that led them to stop thinking about the topic. Someone who thinks for the same amount of time but from a different starting point and who thinks “seems like it should be fairly do-able” might then pursue alignment research and go on to think more deeply. Their different starting points will probably bias their ultimate conclusions—and so will the desire to follow the career path they’ve started on.
So probably we should adjust our estimate of difficulty upward to account for the bias you mention.
But even making an estimate at this point seems premature.
I mention Christiano and Shah because I’ve seen them most visibly try to fully come to grips with the strongest arguments for alignment being very difficult. Ideally, every alignment researcher will do that. And every pause advocate would work just as hard to fully understand the arguments for alignment being achievable. Not everyone will have the time or inclination to do that.
Judging alignment difficulty has to be done by gauging the amount of time-on-task combined with the amount of good-faith consideration of arguments one doesn’t like. That’s the case with everything.
When I try to do that as carefully as I know how, I reach the conclusion that we collectively just don’t know.
Having written that, I have a hard time identifying people who believe alignment is near-impossible who have visibly made an effort to steelman the best arguments that it won’t be that hard. I think that’s understandable; those folks, MIRI and some other individuals, spend a lot of effort trying to correct the thinking of people who are simply over-optimistic because they haven’t thought through the problem far enough yet.
I’d like to write a post called “we should really figure out how hard alignment is”, because I don’t think anyone can reasonably claim to know yet. And without that, we can’t really make strong recommendations for policy and strategy.
I guess that conclusion is enough to say wow, jeez, we should probably not rush toward AGI if we have no real idea how hard it will be to align. I’d much prefer to see that argument than e.g., Max Tegmark saying things along the lines of “we have no idea how to align AGI so it’s a suicide race”. We have lots of ideas at this point, we just don’t know if they will work.
For what it’s worth, I don’t have anywhere near close to ~99% P(doom), but am also in favor of a (globally enforced, hardware-inclusive) AGI scaling pause (depending on details, of course). I’m not sure about Paul or Rohin’s current takes, but lots of people around me are also be in favor of this as well, including many other people who fall squarely into the non-MIRI camp with P(doom) as low as ~10-20%.
Me, too! My reasons are a bit more complex, because I think much progress will continue, and overhangs do increase risk. But in sum, I’d support a global scaling pause, or pretty much any slowdown. I think a lot of people in the middle would too. That’s why I suggested this as a possible compromise position. I meant to say that installing an off switch is also a great idea that almost anyone who’s thought about it would support.
I had been against slowdown because it would create both hardware and algorithmic overhang, making takeoff faster, and re-rolling the dice on who gets there first and how many projects reach it roughly at the same time.
But I think slowdowns would focus effort on developing language model agents into full cognitive architectures on a trajectory to ASI. And that’s the easiest alignment challenge we’re likely to get. Slowdown would prevent jumping to the next, more opaque type of AI.
The difficulty was suggested to be in getting an optimizer to care about what those values are pointing to, not to understand them[1]. If in some instances the values mapped to doing something unwise, using an optimizer that understood those values might fail to constrain away from doing something unwise. Getting a system to use extrapolated preferences as behavioral constraints is a deeper problem than getting a system to reflect surface preferences. The high p(doom) estimates partly follow from expecting that an aligned AI will have to be used to prevent future misaligned/misused AI, and that doing something so high impact would require unsafe behaviors in a system not aligned to reflectively coherent and endorsed extrapolated preferences.
In The Hidden Complexity of Wishes, it wasn’t the genie won’t understand what you meant, it was the genie won’t care what you meant.
This is one of the better short argument for AI doom I have heard so far. It neither obviously makes AI doom seem overly likely or unlikely.
In contrast, if one presents reasons for doom (or really most of anything) as a long list, the conclusion tends to seem either very likely or very unlikely, depending on whether it follows from the disjunction or the conjunction of the given reasons. I.e. whether we have a long list of statements that are sufficient, or a long list of statements that are necessary for AI doom.
It seems therefore that people who think AI risk is low and those who think it is high are much more likely to agree on presenting the AI doom case in terms of a short argument than in terms of a long argument. Then they merely disagree about the conclusion, but not about the form of the argument itself. Which could help a lot with identifying object level disagreements.
I think this is a good object level post. Problem is, I don’t think MIRI is at the object level. Quote from the comm. strat.: “The main audience we want to reach is policymakers.”
Communication is no longer a passive background channel for observing a world, but speech becomes an action changing it. Predictions start to influence the things they predict.
Say AI doom is a certainty. People will be afraid, and stop research. Few years later doom doesn’t happen, everyone complains.
Say AI doom is an impossibility. Research continues, something something paperclips. Few years later nobody will complain because no one will be alive.
(This example itself is overly simplistic, real-world politics and speech actions are even more counterintuitive.)
So MIRI became a political organization. Their stated goal is “STOP AI”, and they took the radical approach to it. Politics is different from rationality, and radical politics is different from standard politics.
For example, they say they want to shatter the overton window. Infighting usually breaks groups; but during that, the opponents need to engage with their position, which is a stated subgoal.
It’s ironic that a certain someone said Politics is the Mind-Killer a decade ago. But because of that, I think they know what they are doing. And it might work in the end.
Interesting, thank you. I think that all makes sense, and I’m sure it plays at least some part in their strategy. I’ve wondered about this possibility a little bit.
Yudkowsky has been consistent in his belief that doom is near certain without a lot more time to work on alignment. He’s publicly held that opinion, and spent a huge amount of effort explaining and arguing for it since well before the current wave of success with deep networks. So I think for him at least, it’s a sincerely held belief.
Your point about the stated belief changing the reality is important. Everything is safer if you think it’s dangerous—you’ll take more precautions.
With that in mind, I think it’s pretty important for even optimists to heavily sprinkle in the message “this will probably go well IF everyone involved is really careful”.
By the way, are you planning on keeping this general format/framework for the final version of your post on this topic? I have some more thoughts on this matter that are closely tied to ideas you’ve touched upon here and that I would like to eventually write into a full post, and referencing yours (once published) at times seems to make sense here.
Thanks! I’ll let you know when I do a full version; it will have all of the claims here I think. But for now, this is the reference; it’s technically a comment but it’s permanent and I consider it a short post.
I’m not sure I see the conflict? If you’re a longtermist, most value is in the far future anyways. Delaying AGI by 10 years to buy just an 0.1% chance improvement at aligning AI seems like a good deal. I don’t agree with MIRI’s strong claims, but maybe those strong claims will slow AI progress, and that would be good by my lights.
What concerns me more is that their comms will have unexpected bad effects of speeding AI progress. On the outside view: (a) their comms have arguably backfired in the past and (b) they don’t seem to do much red-teaming, which I suspect is associated with unintentional harms, especially in a domain with few feedback loops.
Most of the world is not longtermist, which is one reason MIRI’s comms have backfired in the past. Most humans care vastly more about themselves, their children and grandchildren than they do about future generations. Thus, it makes perfect sense to them to increase the chance of a really good future for their children while reducing the odds of longterm survival. Delaying ten years is enough, for instance, to dramatically shift the odds of personal survival for many of us. It might make perfect sense for a utilitarian longtermist to say “it’s fine if I die to gain a .1% chance of a good long term future for humanity”, but that statement sounds absolutely insane to most humans.
Do you think people would vibe with it better if it was framed “I may die, but it’s a heroic sacrifice to save my home planet from may-as-well-be-an-alien-invasion”? Is it reasonable to characterize general superintelligence as an alien takeover and if it is, would people accept the characterization?
Yes, I think that framing would help. I doubt it would shift public opinion that much, probably not even close to more than 50% in the current epistemic environment. The issue is that we really don’t know how hard alignment is. If we could say for sure that pausing for ten years would improve our odds of survival by, say, 25%, then I think a lot of people of the relevant ages (like probably me) would actually accept the framing of a heroic sacrifice.
Yeah, getting specific unpause requirements seems high value for convincing people who would not otherwise want a pause, but I can’t imagine actually getting it in time in any reasonable way, instead it would need to look like technical specification. “Once we have developed x, y, and z, then it is safe to unpause” kind of thing. Just we need to figure out what the x, y, and z requirements are. Then we can estimate how long it will take to develop x, y, and z, and this will get more refined and accurate as more progress is made, but since the requirements are likely to involve unknown unknowns in theory building, it seems likely that any estimate would be more of a wild guess, and it seems like it would be better to be honest about that rather than saying “yeah, sure, ten years” and then after ten years if the progress hasn’t been made saying “whoops, looks like it’s going to take a little longer!” As for odds of survival, my personal estimates feel more like 1% chance of some kind of “alignment by default / human in the loop with prosaic scaling” scheme working, as opposed to maybe more like 50% if we took the time to try to get a “aligned before you turn it on” scheme set up, so that would be improving our odds by about 5000%. Though I think you were thinking of adding rather than scaling odds with your 25%, so 49%, but I don’t think that’s a good habit for thinking about probability. Also I feel hopelessly uncalibrated for this kind of question… I doubt I would trust anyone’s estimates, it’s part of what makes the situation so spooky. How do you think public acceptance would be of a “pause until we meet target x and you are allowed to help us reach target x as much as you want” as opposed to “pause for some set period of time”?
Agreed that scaling rather than addition is usually the better way to think about probabilities. In this case we’ve done so little work on alignment that I think it might actually be more like additive, from 1% to 26% or 50% to 75% with ten extra years relative to the real current odds if we press ahead—which nobody knows.
I’m pretty sure it would be an error to trust anyone’s estimate at this time, because people with roughly equal expertise and wisdom (e.g., Yudkowsky and Christiano) give such wildly different odds. And the discussions between those viewpoints always trail off into differing intuitions.
I also give alignment by default very poor odds, and prosaic alignment as it’s usually discussed. But there are some pretty obvious techniques that are so low-tax that I think they’ll be implemented even by orgs that don’t take safety very seriously.
I’m curious if you’ve read my Instruction-following AGI is easier and more likely than value aligned AGI and/or Internal independent review for language model agent alignment posts. Instruction-following is human-in-the-loop so that may already be what you’re referring to. But some of the techniques in the independent review post (which is also a review of multiple methods) go beyond prosaic alignment to apply specifically to foundation model agents. And wisely-used instruction-following gives corrigibility with a flexible level of oversight.
I’m curious what you think about those techniques if you’ve got time to look.
I think public acceptance of a pause is only part of the issue. The Chinese might actually not pursue AGI if they didn’t have to race the US. But Russia and North Korea will most certainly pursue it (although they’ve got very limited resources and technical chops to make lots of progress in new foundation models, but they still might get to real AGI based on turning next-gen (which there’s not time to pause) foundation models into scaffolded cognitive architectures.
But yes, I do think there’s a chance we could get the US and European public to support a pause using some of the framings you suggest. But we’d better be sure that’s a good idea. Lots of people, notably Russians and North Koreans, are genuinely way less cautious even than Americans—and absolutely will not honor agreements to pause.
Those are some specifics; in general I think it’s only useful to talk about what “we” “should” do in the context of what particular actors actually are likely to do in different scenarios. Humanity is far from aligned, and that’s a problem.
“we’ve done so little work on alignment that I think it might actually be more like additive, from 1% to 26% or 50% to 75% with ten extra years relative to the real current odds if we press ahead—which nobody knows.” 😭🤣 I really want “We’ve done so little work the probabilities are additive” to be a meme. I feel like I do get where you’re coming from.
I agree about pause concern. I also really feel that any delay to friendly SI represents an enormous amount of suffering that could be prevented if we got to friendly SI sooner. It should not be taken lightly. And being realistic about how difficult it is to align humans seems worthwhile. When I talk to math ppl about what work I think we need to do to solve this though, “impossible” or “hundreds of years of work” seem to be the vibe. I think math is a cool field because more than other fields, it feels like work from hundreds of years ago is still very relevant. Problems are hard and progress is slow in a way that I don’t know if people involved in other things really “get”. I feel like in math crowds I’m saying “no, don’t give up, maybe with a hundred years we can do it!” And in other crowds I’m like “c’mon guys, could we have at least 10 years, maybe?” Anyway, I’m rambling a bit, but the point is that my vibe is very much, “if the Russians defect, everyone dies”. “If the North Koreans defect, everyone dies”. “If Americans can’t bring themselves to trust other countries and don’t even try themselves, everyone dies”. So I’m currently feeling very “everyone slightly sane should commit and signal commitment as hard as they can” cause I know it will be hard to get humanity on the same page about something. Basically impossible, never been done before. But so is ASI alignment.
I haven’t read those links. I’ll check em out, thanks : ) I’ve read a few things by Drexler about, like, automated plan generation and then humans audit and enact the plan. It makes me feel better about the situation. I think we could go farther safer with careful techniques like that, but that is both empowering us and bringing us closer to danger, and I don’t think it scales to SI, and unless we are really serious about using it to map RSI boundaries, it doesn’t even prevent misaligned decision systems from going RSI and killing us.
Yes, the math crowd is saying something like “give us a hundred years and we can do it!”. And nobody is going to give them that in the world we live in.
Fortunately, math isn’t the best tool to solve alignment. Foundation models are already trained to follow instructions given in natural language. If we make sure this is the dominant factor in foundation model agents, and use it carefully (don’t say dumb things like “’go solve cancer, don’t bug me with the hows and whys, just git er done as you see fit”, etc), this could work.
We can probably achieve technical intent alignment if we’re even modestly careful and pay a modest alignment tax. You’ve now read my other posts making those arguments.
Unfortunately, it’s not even clear the relevant actors are willing to be reasonably cautious or pay a modest alignment tax.
The other threads are addressed in responses to your comments on my linked posts.
Yes, you’ve written more extensively on this than I realized, thanks for pointing out other relevant posts, sorry for not having taken the time to find them myself, I’m trying to err more on the side of communication than I have in the past.
I think math is the best tool to solve alignment. It might be emotional, I’ve been manipulated and hurt by natural language and the people who prefer it to math and have always found engaging with math to be soothing or at least sobering. It could also be that I truly believe that the engineering rigor that comes with understanding something enough to do math to it is extremely worthwhile for building a thing of the importance we are discussing.
Part of me wants to die on this hill and tell everyone who will listen “I know its impossible but we need to find ways to make it possible to give the math people the hundred years they need because if we don’t then everyone dies so theres no point in aiming for anything less and its unfortunate because it means it’s likely we are doomed but that’s the truth as I see it.” I just wonder how much of that part of me is my oppositional defiance disorder and how much is my strategizing for best outcome.
I’ll be reading your other posts. Thanks for engaging with me : )
I certainly don’t expect people to read a bunch of stuff before engaging! I’m really pleased that you’ve read so much of my stuff. I’ll get back to these conversations soon hopefully, I’ve had to focus on new posts.
I think your feelings about math are shared by a lot of the alignment community. I like the way you’ve expressed those intuitions.
I think math might be the best tool to solve alignment if we had unlimited time—but it looks like we very much do not.