What would we do if alignment were futile?
This piece, which predates ChatGPT, is no longer endorsed by its author.
Eliezer’s recent discussion on AGI alignment is not optimistic.
I consider the present gameboard to look incredibly grim… We can hope there’s a miracle that violates some aspect of my background model, and we can try to prepare for that unknown miracle
For this post, instead of debating Eliezer’s model, I want to pretend it’s true. Let’s imagine we’ve all seen satisfactory evidence for the following:
AGI is likely to be developed soon*
Alignment is a Hard Problem. Current research is nowhere close to solving it, and this is unlikely to change by the time AGI is developed
Therefore, when AGI is first developed, it will only be possible to build misaligned AGI. We are heading for catastrophe
How we might respond
I don’t think this is an unsolvable problem. In this scenario, there are two ways to avoid catastrophe: massively increase the pace of alignment research, and delay the deployment of AGI.
Massively increase the pace of alignment research via 20x more money
I wouldn’t rely solely on this option. Lots of brilliant and well-funded people are already trying really hard! But I bet we can make up some time here. Let me pull some numbers out of my arse:
$100M per year is spent per year on alignment research worldwide (this is a guess, I don’t know the actual number)
Our rate of research progress is proportional to the square root of our spending. That is, to double progress, you need to spend 4x as much**
Suppose we spent $2B a year. This would let us accomplish in 5 years what would otherwise have taken 22 years.
$2B a year isn’t realistic today, but it’s realistic in this scenario, where we’ve seen persuasive evidence Eliezer’s model is true. If AI safety is the critical path for humanity’s survival, I bet a skilled fundraiser can make it happen
Of course, skillfully administering the funds is its own issue...
Slow down AGI development
The problem, as I understand it:
Lots of groups, like DeepMind, OpenAI, Huawei, and the People’s Liberation Army, are trying to build powerful AI systems
No one is very far ahead. For a number of reasons, it’s likely to stay that way
We all have access to roughly the same computing power, within an OOM
We’re all seeing the same events unfold in the real world, leading us to similar insights
Knowledge tends to proliferate among researchers. This is in part a natural tendency of academic work, and in part a deliberate effort by OpenAI
When one group achieves the capability to deploy AGI, the others will not be far behind
When one group achieves the capability to deploy AGI, they will have powerful incentives to deploy it. AGI is really cool, will make a lot of money, and the first to deploy it successfully might be able to impose their values on the entire world
Even if they don’t deploy it, the next group still might. If even one chooses to deploy, a permanent catastrophe strikes
What can we do about this?
1. Persuade OpenAI
First, let’s try the low hanging fruit. OpenAI seems to be full of smart people who want to do the right thing. If Eliezer’s position is true, then I bet some high status rationalist-adjacent figures could be persuaded. In turn, I bet these folks could get a fair listen from Sam Altman/Elon Musk/Ilya Sutskever.
Maybe they’ll change their mind. Or maybe Eliezer will change his own mind.
2. Persuade US Government to impose stronger Export Controls
Second, US export controls can buy time by slowing down the whole field. They’d also make it harder to share your research, so the leading team accumulates a bigger lead. They’re easy to impose: it’s a regulatory move, so an act of Congress isn’t required. There are already export controls on narrow areas of AI, like automated imagery analysis. We could impose export controls on areas likely to contribute to AGI and encourage other countries to follow suit.
3. Persuade leading researchers not to deploy misaligned AI
Third, if the groups deploying AGI genuinely believed it would destroy the world, they wouldn’t deploy it. I bet a lot of them are persuadable in the next 2 to 50 years.
4. Use public opinion to slow down AGI research
Fourth, public opinion is a dangerous instrument. It’d make a lot of folks miserable, to give AGI the same political prominence (and epistemic habits) as climate change research. But I bet it could delay AGI by quite a lot.
5. US commits to using the full range of diplomatic, economic, and military action against those who violate AGI research norms
Fifth, the US has a massive array of policy options for nuclear nonproliferation. These range from sanctions (like the ones crippling Iran’s economy) to war. Right now, these aren’t an option for AGI, because the foreign policy community doesn’t understand the threat of misaligned AGI. If we communicate clearly and in their language, we could help them understand.
What now?
I don’t know whether the grim model in Eliezer’s interview is true or not. I think it’s really important to find out.
If it’s false (alignment efforts are likely to work), then we need to know that. Crying wolf does a lot of harm, and most of the interventions I can think of are costly and/or destructive.
But if it’s true (current alignment efforts are doomed), we need to know that in a legible way. That is, it needs to be as easy as possible for smart people outside the community to verify the reasoning.
*Eliezer says his timeline is “short,” but I can’t find specific figures. Nate Soares gives a very substantial chance of 2 to 20 years and is 85% confident we’ll see AGI by 2070
**Wild guess, loosely based on Price’s Law. I think this works as long as we’re nowhere close to exhausting the pool of smart/motivated/creative people who can contribute
After seeing a number of rather gloomy posts on the site in the last few days, I feel a need to point out that problems that we don’t currently know how to solve always look impossible. A smart guy once pointed out how silly it was the Lord Kelvin claimed “The influence of animal or vegetable life on matter is infinitely beyond the range of any scientific inquiry hitherto entered on.” Kelvin just didn’t know how to do it. That’s fine. Deciding it’s a Hard Problem just sort of throws up mental blocks to finding potential obvious solutions.
Maybe alignment will seem really easy in retrospect. Maybe it’s the sort of thing that requires only two small insights that we don’t currently have. Maybe we already have all the insights we need and somebody just needs to connect them together in a non-obvious way. Maybe somebody has already had the key idea, and just thought to themselves, no, it can’t be that simple! (I actually sort of viscerally suspect that the lynchpin of alignment will turn out to be something really dumb and easy that we’ve simply overlooked, and not something like Special Relativity.) Everything seems hard in advance, and we’ve spent far more effort as a civilization studying asphalt than we have alignment. We’ve tried almost nothing so far.
In the same way that we have an existence-proof of AGI (humans existing) we also have a highly suggestive example of something that looks a lot like alignment (humans existing and often choosing not to do heroin), except probably not robust to infinite capability increase, blah blah.
The “probabilistic mainline path” always looks really grim when success depends on innovations and inventions you don’t currently know how to do. Nobody knows what probability to put on obtaining such innovations in advance. If you asked me ten years ago I would have put the odds of SpaceX Starship existing at like 2%, probably even after thinking really hard about it.
That’s not an example of alignment, that’s an example of sub-agent stability, which is assumed to be true due to instrumental convergence in any sufficiently powerful AI system, aligned or unaligned.
If anything, humanity is an excellent example of alignment failure considering we have discovered the true utility function of our creator and decided to ignore it anyway and side with proxy values such as love/empathy/curiosity etc.
Our creator doesn’t have a utility function in any meaningful sense of the term. Genes that adapt best for survival and reproduction propagate through the population, but it’s competitive. Evolution doesn’t have goals, and in fact from the standpoint of individual genes (where evolution works) it is entirely a zero-sum game.
Or we are waiting to be outbred by those who didn’t. A few centuries ago, the vast majority of people were herders or farmers who had as many kids as they could feed. Their actions were aligned with maximization of their inclusive genetic fitness. We are the exception, not the rule.
When I look at the world today, it really doesn’t seem like a ship steered by evolution. (Instead it is a ship steered by no one, chaotically drifting.) Maybe if there is economic and technological stagnation for ten thousand years, then maybe evolution will get back in the drivers seat and continue the long slow process of aligning humans… but I think that’s very much not the most probable outcome.
I agree with the first two paragraphs here. :)
Indeed, these are items on a ‘high-level reasons not to be maximally pessimistic about AGI’ list I made for some friends three years ago. Maybe I’ll post that on LW in the next week or two.
I share Eliezer’s pessimism, but I worry that some people only have negative factors bouncing around in their minds, and not positive factors, and that this is making them overshoot Eliezer’s ‘seems very dire’ and go straight to ‘seems totally hopeless’. (Either with regard to alignment research, or with regard to the whole problem. Maybe also related to the tendency IME for people to either assume a problem is easy or impossible, without much room in between.)
I agree. This wasn’t meant as an object level discussion of whether the “alignment is doomed” claim is true. What I’d hopes to convey is that, even if the research is on the wrong track, we can still massively increase the chances of a good outcome, using some of the options I described
That said, I don’t think Starship is a good analogy. We already knew that such a rocket can work in theory, so it was a matter of engineering, experimentation, and making a big organization work. What if a closer analogy to seeing alignment solved was seeing a proof of P=NP this year?
It doesn’t seem credible for AIs to be more aligned with researchers than researchers are aligned with each other, or with the general population.
Maybe that’s ‘gloomy’ but thats no different than how human affairs have progressed since the first tribes were established. From the viewpoint of broader society it’s more of positive development to understand there’s an upper limit for how much alignment efforts can expect to yield. So that resources are allocated properly to their most beneficial usage.
Thank you for articulating this. This matches closely with my own thoughts re Eliezer’s recently published discussion. I strongly agree that if Eliezer is in fact correct then the single most effective thing we could do is to persuasively show that to be true. Right now it’s not even persuasive to many / most alignment researchers, let alone anybody else.
Conditional on Eliezer being wrong though, I’m not sure how valuable showing him to be wrong would be. Presumably it would depend on why exactly he’s wrong, because if we knew that then we might be able to direct or resources more effectively.
I think that for those who agree with Eliezer, this is a very strong argument in favor of pouring money and resources into forecasting research or the like—as Open Philanthropy is in fact doing, I think. And even for people who disagree, if they put any non trivial probability mass on Eliezer’s views, that would still make this high priority.
Why isn’t there a persuasive write-up of the “current alignment research efforts are doomed” theory?
EY wrote hundreds of thousands of words to show that alignment is a hard and important problem. And it worked! Lots of people listened and started researching this
But that discussion now claims these efforts are no good. And I can’t find good evidence, other than folks talking past each other
I agree with everything in your comment except the value of showing EY’s claim to be wrong:
Believing a problem is harder than it is can stop you from finding creative solutions
False believe in your impending doom leads to all sorts of bad decisions (like misallocating resources, or making innocent researchers’ lives worse)
Belief in your impending doom is terrible for your mental heath (tbh I sensed a bit of this in the EY discussion)
Insulting groups like OpenAI destroys a lot of value, especially if EY is actually wrong
If alignment were solved, then developing AGI would be the best event in human history. It’d be a shame to prevent that
In other words, if EY is right, we really need to know that. And know it in way that’s easy to persuade others. If EY is wrong, we need to know that too, and stop this gloom and doom
I think by impending doom you mean AI doom after a few years or decades, so “impending” from a civilizational perspective, not from an individual human perspective. If I misinterpret you, please disregard this post.
I disagree on your mental health point. Main lines of argument: people who lose belief in heaven seem to be fine, cultures that believe in oblivion seem to be fine, old people seem to be fine, etc. Also, we evolved to be mortal, so we should be surprised if evolution has left us mentally ill-prepared for our mortality.
However, I discovered/remembered that depression is a common side-effect of terminal illness. See Living with a Terminal Illness. Perhaps that is where you are coming from? There is also Death row phenomenon, but that seems to be more about extended solitary confinement than impending doom.
I don’t think this is closely analogous to AI doom. A terminal illness might mean a life expectancy measured in months, whereas we probably have a few years or decades. Also our lives will probably continue to improve in the lead up to AI doom, where terminal illnesses come with a side order of pain and disability. On the other hand, a terminal illness doesn’t include the destruction of everything we value.
Overall, I think that belief in AI doom is a closer match to belief in oblivion than belief in cancer and don’t expect it to cause mental health issues until it is much closer. On a personal note, I’ve placed > 50% probability on AI doom for a few years now, and my mental health has been fine as far as I can tell.
However, belief in your impending doom, when combined with belief that “Belief in your impending doom is terrible for your mental heath”, is probably terrible for your mental health. Also, belief that “Belief in your impending doom is terrible for your mental heath” could cause motivated reasoning that makes it harder to salvage value in the face of impending doom.
Zvi just posted EY’s model
If I was convinced that any AI built would destroy the world, I would advocate for a preemptive Butlerian Jihad. I would want coordinated dismantling of all supercomputing facilities (à la nuclear disarmament), which I would define as any system currently capable of training GPT-3. I would want global (NATO/China) coordination to shut down all major semiconductor fabrication plants. Despite my distaste for tyrannical policy, I would support government efforts to confiscate and destroy every Xbox, Playstation, and every GPU over 2 TFlops.
Since we are one or more key insights away from AGI, a small guerilla group with a Beowulf cluster would not be too much of a risk. Even a smallish rogue nation would not be too bad, as AI talent is highly concentrated in Europe, the USA, and China.
There’s no way world governments would coordinate around this, especially since it is a) a problem that most people barely understand and b) would completely cut off all human technological progress. No one would support this policy. Hell, even if ridiculously powerful aliens à la God came and told us that we weren’t allowed to build AGI on the threat of eternal suffering, I’m not sure world governments would coordinate around this.
If alignment was impossible, we might just be doomed.
Yeah. On the off chance that the CIA actually does run the government from the shadows, I really hope some of them lurk on LessWrong.
I think this is a false dichotomy. Eliezer’s
position is thatsaid that AI alignment requires a “miracle” aka “positive model violation” aka “surprising positive development of unknown shape”. If that is false, that does not mean that alignment efforts are “likely to work”. They could still fail just for ordinary non-miraculous reasons. Civilization has failed to prevent many disasters that didn’t require miracles to prevent.Governments are not known to change their policies based on carefully reasoned arguments, nor do they impose pro-active restrictions on a technology without an extensive track record of the technology having large negative side-effects. A big news-worthy event would need to happen in order for governments to take the sort of actions that could have a meaningful impact on AI timelines, something basically on the scale of 9/11 or larger.
I think steering capabilities research in directions that are likely to yield “survivable first strikes” would be very good and could create common knowledge about the necessity of alignment research. I think GPT-3 derivatives have potential here, they are sort of capped in terms of capability by being trained to mimic human output, yet they’re strong enough that a version unleashed on the internet could cause enough survivable harm to be obvious. Basically we need to maximise the distance between “model strong enough to cause survivable harm” and “model strong enough to wipe out humanity” in order to give humanity time to respond after the coordination-inducing-event.
It would need to be the sort of harm that is highly visible and concentrated in time and space, like 9/11, and not like “increasing the incidence of cancer worldwide by 10%” or “driving people insane with polarizing news feed”.
A thought: could we already have a case study ready for us?
Governments around the world are talking about regulating tech platforms. Arguably Facebook’s News Feed is an AI system and the current narrative is that it’s causing mass societal harm due to it optimizing for clicks/likes/time on Facebook/whatever rather than human values.
See also:
This story about how Facebook engineers tried to make tweaks to the News Feed algorithm’s utility function and it backfired.
This story about how Reddit’s recommendation algorithms may have influenced some of the recent stock market craziness.
All we’d have to do is to convince people that this is actually an AI alignment problem.
That’s gonna be really hard, people like Yann lecun (head of Facebook AI) see these problems as evidence that alignment is actually easy. “See, there was a problem with the algorithm, we noticed it and we fixed it, what are you so worried about? This is just a normal engineering problem to be solved with normal engineering means.” Convincing them that this is actually an early manifestation of a fundamental difficulty that becomes deadly at high capability levels will be really hard.
Do we have to convince Yann LeCun? Or do we have to convince governments and the public?
(Though I agree that the word “All” is doing a lot of work in that sentence, and that convincing people of this may be hard. But possibly easier than actually solving the alignment problem?)
That’s how you turn a technical field into a cesspit of social commentary and political virtue signaling.
Think less AGI-Overwatch committee or GPU-export ban and more “Big business bad!”, “AI racist!”, “Human greed the real problem!”
There seems to be an implicit premise lurking in the background, please correct me if I’m wrong, that determining the degree of alignment to a mutually satisfactory level will be possible at all, that it will not be NP hard.
i.e. Even if that ‘unknown miracle’ does happen, can we be certain that everyone in a room could closely agree on the ‘how much alignment’ questions?
Try to come up with the best possible “unaligned” AGI. It’s better to be eaten by something that then goes out and explores a broad range of action in an interesting way, especially if you can arrange that it enjoy it, than it is to be eaten by Clippy.
“It’s better to be eaten by something … if you can arrange that it enjoy it”
Given my current state of knowledge, I’d advise against trying to do this because it might introduce s-risks. If you don’t know how to do alignment, then I’d guess it’s best to steer clear of conscious states if possible.
(Separately, I disagree because I don’t think our success odds are anywhere near low enough that we should give up on alignment!)
Point taken. Although, to be honest, I don’t think I can tell what would or would not be conscious anyway, and I haven’t heard anything to convince me that anybody else can either.
… and probably I shouldn’t have answered the headline question while ignoring the text’s points about delay and prevention...
I think we may have different terminal values. I would much rather live out my life in a technologically stagnant world than be eaten by a machine that comes up with interesting mathematical proofs all day.
They may be persuadable that, in a non-emergency situation, they should slow down when their AI seems like it’s teetering on the edge of recursive self-improvement. It’s much harder to persuade them to
1. not publish their research that isn’t clearly “here’s how to make an AGI”, and/or
2. not try to get AGI without a good theory of alignment, when “the other guys” seem only a few years away from AGI.
So ~everyone will keep adding to the big pool of ~public information and ideas about AI, until it’s not that hard to get the rest of the way to AGI, at which point some people showing restraint doesn’t help by that much.
I think this will have the opposite effect. Restricting supply of hardware will only further incentivize efficiency-focused research, which I think is much more critical on the path to AGI than “stack moar layers”.
Even worse, that kind of move would just convince the competitors that AGI is far more feasible, and incentivize them to speed up their efforts while sacrificing safety.
If blocking Huwaei failed to work a couple of years ago with an unusually pugnacious American presidency, I doubt this kind of move would work in the future where the Chinese technological base would be probably stronger.
I should’ve been more clear…export controls don’t just apply to physical items. Depending on the specific controls, it can be illegal to publicly share technical data, including source code, drawings, and sometimes even technical concepts
This makes it really hard to publish papers, and it stops you from putting source code or instructions online
Roman Yampolsky has said recently (at a Foresight Salon event, the recording should be posted on YouTube soon) that it would be highly valuable if someone could prove that alignment is impossible. Given the high value for informing AI existential safety investment, I agree with Yampolsky we should have more people working on this (trying to prove theorems (or creating very rigorous arguments) as to whether alignment is possible or impossible).
If we knew with very high certainty that alignment is impossible, than that would compel us to invest more resources into 1. bans/regulation on self-improving AI and other forms of dangerous AI (to buy us time) and 2. figuring out how to survive a world where un-aligned AI is likely to be running rampant soon (for instance maybe we could buy ourselves some time by having humans try to survive in a Mars base or underground bunkers or we could try merging with the AI in hopes of preserving some of what we value that way).
Yamopolsky and collaborators have a paper on this here (disclaimer—I haven’t read it and can’t vouch for it’s value).
Also… alignment is obviously continuum and of course 100% alignment with all human values is impossible.
A different thing you could prove is whether it’s possible to guarantee human control over an AI system as it becomes more intelligent.
There’s also a concern that a slightly unaligned system may become more and more aligned as its intelligence is scaled up (either by humans re-building/trianing it with more parameters/hardware or via recursive self-improvement). It would useful if someone could prove whether that is impossible to prevent.
I need to think about this more and read Yampolsky’s paper to really understand what would be the most useful to prove is possible or impossible.
There may be a nanotech critical point. Getting to full advanced nanotech probably involves many stages of bootstrapping. If lots of nanobots have been designed on a computer, then an early stage of the bootstrapping process might be last to be designed. (Building a great nanobot with a mediocre nanobot might be easier than building the mediocre nanobot from something even worse.) This would mean a sudden transition where one group potentially suddenly had usable nanotech.
So, can a team of 100 very smart humans, working together, with hand coded nanotech, stop an ASI being created.
I would be unsurprised if blindly scanning and duplicating a human, to the resolution where memories and personality was preserved, was not that hard with hand coded nanotech. (Like a few researcher months of effort)
Making nanomachines that destroy GPU’s seems not that hard either.
Nor does making enough money to just buy all the GPU’s and top AI talent available.
Actually, find everyone capable of doing AI research, and pay them 2x as much to do whatever they like (as long as they don’t publish and don’t run their code on non toy problems) sounds like a good plan in general.
“For only $10 000 a year in fine whisky, we can keep this researcher too drunk to do any dangerous AI research.” But thousands more researchers like him still spend their nights sober adjusting hyperparameters. That’s why were asking you to help in this charity appeal.
(Idea meant more as interesting wild speculation. Unintended incentives exist. That isn’t to say it would be totally useless.)
If destroying GPUs is the goal, there seem to be a lot simpler, less speculative ways than nanomachines. The semiconductor industry is among the most vulnerable, as the pandemic has shown, with an incredibly long supply chain that mostly consists of a single or a handful of suppliers, defended against sabotage largely by “no one would actually do such a thing”.
Of course that is assuming we don’t have a huge hardware overhang in which case current stockpiles might already be sufficient for doom, or that ASI will be based heavily on GPU computing at all.
If you are really convinced that
1) AGI is coming really fast.
2) Work on alignment has basically no chances to break through in time.
3) Un aligned AGI result in quick and complete annihilation for humankind
4) You firmly believe in a utilitarism/consequentialism.
which seems to me to be Eliezer’s model,
then you should focus your efforts to launch an all out nuclear war between the USA and China, which would be very unlikely to destroy humanity.
You could even move MIRI to New-Zealand or something so that work on alignment can continue after the nuclear blast.
See the similar comment here.
Personally, I think that we can do better than starting a nuclear war (which, after all, just delays the problem, and probably leaves civilization in an even WORSE place to solve alignment when the problem eventually rears its head again—although your idea about disaster-proofing MIRI and other AI safety orgs is interesting), as I said in a reply to that comment. Trying to reduce Earth’s supply of compute (including through military means), and do other things to slow down the field of AI (up to and including the kind of stuff that we’d need to do to stop the proliferation of Nick Bostrom’s “easy nukes”) seems promising. Then with the extra time that buys, we can make differential progress in other areas:
Alignment research, including searching for whole new AGI paradigms that are easier to align.
Human enhancement via genetic engineering, BCIs, brain emulation, cloning John Von Neumann, or whatever.
Better governance tech (prediction markets, voting systems, etc), so that the world can be governed more wisely on issues of AI risk and everything else.
But just as I said in that comment thread, “I’m not sure if MIRI / LessWrong / etc want to encourage lots of public speculation about potentially divisive AGI ‘nonpharmaceutical interventions’ like fomenting nuclear war. I think it’s an understandably sensitive area, which people would prefer to discuss privately.”
Trying to reduce the amount of compute risks increasing hardware overhang once that compute is rebuilt. I think trying to slow down capabilities research (e.g. by getting a job at an AI lab and being obstructive) is probably better.
edit: meh idk. Whether or not this improves things depends on how much compute you can destroy & for how long, ml scaling, politics, etc etc. But the current world of “only big labs with lots of compute budget can achieve SOTA” (arguable, but possibly more true in the future) and less easy stuff to do to get better performance (scaling) both seem good.
So, start making the diplomatic situation around Taiwan as bad as possible? ;)