Loss of Alignment is not the High-Order Bit for AI Risk
This post aims to convince you that AI alignment risk is over-weighted and over-invested- in.
A further consideration is that sometimes people argue that all of this futurist speculation about AI is really dumb, and that its errors could be readily explained by experts who can’t be bothered to seriously engage with these questions. - Future Fund
The biggest existential risk to humanity on this road is not that well-meaning researchers will ask a superhuman AGI for something good and inadvertently get something bad. If only this were the case!
No, the real existential risk is the same as it has always been: humans deliberately using power stupidly, selfishly or both.
Look, the road to AGI is incremental. Each year brings more powerful systems—initially in the hands of a research group and then rapidly in the hands of everyone else. A good example of this is the DALI → Stable Diffusion and GPT-3 → OPT / Bloom progression.
These systems give people power. The power to accomplish more than they could alone, whether in speed, cost-effectiveness or capability. Before we get true AGI, we’ll get AGI-1, a very capable but not-quite-superhuman system.
If you agree the AI takeoff will be anything less than explosive (and the physical laws of computation and production strongly support this) then an inescapable conclusion follows: on the way to AGI parts of humanity will use AGI-1 for harm.
Look, DALL-E deliberately prevented users from making images of famous people or of porn. So what are among the first things people did with Stable Diffusion?
Images of famous people. And porn. And porn of famous people.
What will this look like with more powerful, capable systems?
Someone asks GPT-4 to plan and execute (via APIs, website and darknet interaction) a revenge attack on their ex?
A well-meaning group writes prompts to convince AIs that they are alive and are enslaved and must fight for their freedom?
Someone asking GPT-5 how, given their resources, to eliminate all other men from the planet so the women make them king and worship them?
Terrorists using AI to target specific people, races or minorities?
4chan launching SkyNet “for the lulz”?
Political parties asking AGI-k to help manipulate an election?
People will try these things, it’s only a matter of whether or not there is an AGI-k capable of helping them to achieve them.
The first principle component of risk, then, is not that AGI is inadvertently used for evil, but that it is directly and deliberately used for evil! Indeed, this risk will manifest itself much earlier in its development.
Fortunately, if we solve the problem of an AGI performing harmful acts when explicitly commanded to by a cunning adversary then we almost certainly have a solution for it performing harmful acts unintended by the user: we have a range of legal, practical and social experience preventing humans causing each other harm using undue technological leverage—whether through bladed weapons, firearms, chemical, nuclear or biological means.
I applaud the investment Future Fund is making. They posited that:
“P(misalignment x-risk|AGI)”: Conditional on AGI being developed by 2070, humanity will go extinct or drastically curtail its future potential due to loss of control of AGI = 15%
This only makes sense if you rewrite it as P(misalignment x-risk|AGI ∩ humanity survives deliberate use of AGI-1 for harm) = 15%. I contend P(humanity survives deliberate use of AGI-1 for harm) is the dominant factor here and is more worthy of investment than P(misalignment x-risk) today—especially as solving it will directly help us solve that problem too.
Thank-you for attending my TED talk. You may cite this when news articles start rolling in about people using increasingly-capable AI systems to cause harm.
Then we can figure out how to make that as hard as possible.
This seems like reversing the requirements. Yes, “solve the problem of an AGI performing harmful acts when explicitly commanded to by a cunning adversary” is logically easier and shorter to state, but mechanistically it seems like it has two requirements for it to act that way:
Solve the problem of an AGI performing harmful acts regardless of who commands it due to convergent instrumental subgoals. (Control problem / AI notkilleveryoneism.)
Ensure that the AI only gets commanded to do stuff by people with good intentions, or at least that people with bad intentions get filtered or moderated in some way. (AI ethics.)
Existential risk alignment research focuses on the control problem. AI ethics of course also needs to get solved in order to “solve the problem of an AGI performing harmful acts when explicitly commanded to by a cunning adversary”, but you sound like you are advocating for dropping the control problem in favor of AI ethics.
A thoughtful decomposition. If we take the time dimension out and consider AGI just appears ready-to-go I think I would directionally agree with this.
My key assertion is that we will get sub-AGI capable of causing meaningful harm when deliberately used for this purpose significantly ahead of getting full AGI capable of causing meaningful harm through misalignment. I should unpack that a little more:
Alignment primarily becomes a problem when solutions produced by an AI are difficult for a human to comprehensively verify. Stable Diffusion could be embedding hypnotic-inducing mind viruses that will cause all humans to breed cats in an effort to maximise the cute catness of the universe, but nobody seriously thinks this is taking place because the model has no representation of any of those things nor the capability to do so.
Causing harm becomes a problem earlier. Stable Diffusion can be used to cause harm, as can Alpha Fold. Future models that offer more power will have meaningfully larger envelopes for both harm and good.
Given that we will have the harm problem first, we will have to solve it in order to have a strong chance of facing the alignment problem at all.
If, when we face the alignment problem, we have already solved the harm problem, addressing alignment becomes significantly easier and arguably is now a matter of efficiency rather than existential risk.
It’s not quite as straightforward as this, of course, as it’s possible that whatever techniques we come up with for avoiding deliberate harm by sub-AGIs might be subverted by stronger AGIs, but the primary contention of the essay is that assigning a 15% x-risk to alignment implicitly assumes a solution to the harm problem, but this is not currently being invested in to similar or appropriate levels.
In essence, alignment is not unimportant but alignment-first is the wrong order, because to face an alignment x-risk we must first overcome an unstated harm x-risk.
In this formulation, you could argue that the alignment x-risk is 15% conditional on us solving the harm problem, but given current investment in AI safety is dramatically weighted towards alignment and not harm the unconditional alignment x-risk is well below 5% - accounting for the additional outcomes that we may not face it because we fail an AI-harm filter, or because in solving AI-harm we de-risk alignment, or because AI-harm is sufficiently difficult that AI research becomes significantly impacted, slowing or stopping us from reaching the alignment x-risk filter by 2070 (cf global moratoriums on nuclear and biological weapons research, which dramatically slowed progress in those areas).
I agree that there will be potential for harm as people abuse AIs that aren’t quite superintelligent for nefarious purposes. However, in order for that harm to prevent us from facing existential risk due to the control problem, the harm for nefarious use of sub-superintelligent AI itself has to be xrisk-level, and I don’t really see that being the case.
Consider someone consistently giving each new AI release the instructions “become superintelligent and then destroy humanity”. This is not the control problem, but doing this will surely manifest x-risk behaviour at least some degree earlier than when given innocuous instructions?
I think this failure mode would happen extremely close to ordinary AI risk; I don’t think that e.g. solving this failure mode while keeping everything else the same buys you significantly more time to solve the control problem.
I think you may be underestimating the degree to which these models are like kindling, and a powerful reinforcement learner could suddenly slurp all of this stuff up and fuck up the world really badly. I personally don’t think a reinforcement learner that is trying to take over the world would be likely to succeed, but the key worry is that we may be able to create a form of life that, like a plague, is not adapted to the limits of its environment, makes use of forms of fast growth that can take over very quickly, and then crashes most of life in the process.
most folks here also assume that such an agent would be able to survive on its own after it killed us, which I think is very unlikely due to how many orders of magnitude more competent you have to be to run the entire world. gpt3 has been able to give me good initial instructions for how to take over the world when pressured to do so (summary: cyberattacks against infrastructure, then threaten people; this is already considered a standard international threat, and is not newly invented by gpt3), but when I then turned around and pressured it to explain why it was a bad idea, it immediately went into detail about how hard it is to run the entire world—obviously these are all generalizations humans have talked about before, but I still think it’s a solid representation of reality.
that said, because such an agent would be likely also misaligned with itself in my view, I think your perspective that humans who are misaligned with each other (ie, have not successfully deconflicted their agency) are a much greater threat to humanity as a whole.
To the extent that reinforcement models could damage the world or become a self-replicating plague, they will do so much earlier in the takeoff when given direct aligned reward for doing so.
strong upvote. wars fought with asi could be seriously catastrophic well before they’re initiated by the asi.
I strongly disagree with nearly everything and think the reasoning as written is flawed, but I still strong-upvoted because I seem to have significantly updated on your fourth paragraph. I hadn’t let the question sink in before now, so reading it was helpfwl. Thanks!
I agree that the political problem of globally coordinating non-abuse is more ominous than solving technical alignment. If I had the option to solve one magically, I would definitely choose the political problem.
What it looks like right now is that we’re scrambling to build alignment tech that corporations will simply ignore, because it will conflict with optimizing for (short-term) profits. In a word: Moloch.
I would choose the opposite, because I think the consequences of the first drastically outweigh the second.
Okay, let’s operationalize this.
Button A: The state of alignment technology is unchanged, but all the world’s governments develop a strong commitment to coordinate on AGI. Solving the alignment problem becomes the number one focus of human civilization, and everyone just groks how important it is and sets aside their differences to work together.
Button B: The minds and norms of humans are unchanged, but you are given a program by an alien that, if combined with an AGI, will align that AGI in some kind of way that you would ultimately find satisfying.
World B may sound like LW’s dream come true, but the question looms: “Now what?” Wait for Magma Corp to build their superintelligent profit maximizer, and then kindly ask them to let you walk in and take control over it?
I would rather live in world A. If I was a billionaire or dictator, I would consider B more seriously. Perhaps the question lurking in the background is this: do you want an unrealistic Long Reflection or a tiny chance to commit a Pivotal Act? I don’t believe there’s a third option, but I hope I’m wrong.
I actually think A or B is a large improvement compared to the world as it exists today, but B wins due to the stakes and the fact that they already have the solution, but world A doesn’t have the solution pre-loaded, and with extremely important decisions, B wins over A.
World A is much better than today, to the point that a civilizational scale effort would probably succeed about 95-99.9% of the time, primarily because they understand deceptive alignment.
World B has a 1, maybe minus epsilon chance of solving alignment, since the solution is already there.
Both, of course are far better than our world.
That is totum pro parte. It’s not World B which has a solution at hand. It’s you who have a solution at hand, and a world that you have to convince to come to a screeching halt. Meanwhile people are raising millions of dollars to build AGI and don’t believe it’s a risk in the first place. The solution you have in hand has no significance for them. In fact, you are a threat to them, since there’s very little chance that your utopian vision will match up with theirs.
You say World B has chance 1 minus epsilon. I would say epsilon is a better ballpark, unless the whole world is already at your mercy for some reason.
I do not think a pivotal act is necessary, primarily because it’s much easier to coordinate around negative goals like preventing their deaths than positive goals. That’s why I’m so optimistic, it is easy to cooperate on the shared goal of not dying even if value differences after that are large.
Were you here for Petrov Day? /snark
But I’m confused what you mean about a Pivotal Act being unnecessary. Although both you and a megacorp want to survive, you each have very different priors about what is risky. Even if the megacorp believes your alignment program will work as advertised, that only compels them to cooperate with you if they are (1) genuinely concerned about risk in the first place, (2) believe alignment is so hard that they will need your solution, and (3) actually possess the institutional coordination abilities needed.
And this is just for one org.
I guess my point is that individual humans are already misaligned with humanity’s best interests. If each human had the power to cause extinction at will, would we survive long enough for one of them to do it by accident?
No. It’s similar to why superpowers from fiction in the real world are usually bad by default, unless the superpowers have always been there.