The linked post is work done by Tom Adamczewski while at FHI. I think this sort of expository and analytic work is very valuable, so I’m cross-posting it here (with his permission). Below is an extended summary; for the full document, see his linked blog post.
Many people now work on ensuring that advanced AI has beneficial consequences. But members of this community have made several quite different arguments for prioritising AI.
Early arguments, and in particular Superintelligence, identified the “alignment problem” as the key source of AI risk. In addition, the book relies on the assumption that superintelligent AI is likely to emerge through a discontinuous jump in the capabilities of an AI system, rather than through gradual progress. This assumption is crucial to the argument that a single AI system could gain a “decisive strategic advantage”, that the alignment problem cannot be solved through trial and error, and that there is likely to be a “treacherous turn”. Hence, the discontinuity assumption underlies the book’s conclusion that existential catastrophe is a likely outcome.
The argument in Superintelligence combines three features: (i) a focus on the alignment problem, (ii) the discontinuity assumption, and (iii) the resulting conclusion that an existential catastrophe is likely.
Arguments that abandon some of these features have recently become prominent. They also generally tend to have been made in less detail than the early arguments.
One line of argument, promoted by Paul Christiano and Katja Grace, drops the discontinuity assumption, but continues to view the alignment problem as the source of AI risk. Even under more gradual scenarios, they argue that, unless we solve the alignment problem before advanced AIs are widely deployed in the economy, these AIs will cause human values to eventually fade from prominence. They appear to be agonistic about whether these harms would warrant the label “existential risk”.
Moreover, others have proposed AI risks that are unrelated to the alignment problem. I discuss three of these: (i) the risk that AI might be misused, (ii) that it could make war between great powers more likely, and (iii) that it might lead to value erosion from competition. These arguments don’t crucially rely on a discontinuity, and the risks are rarely existential in scale.
It’s not always clear which of the arguments actually motivates members of the beneficial AI community. It would be useful to clarify which of these arguments (or yet other arguments) are crucial for which people. This could help with evaluating the strength of the case for prioritising AI, deciding which strategies to pursue within AI, and avoiding costly misunderstanding with sympathetic outsiders or sceptics.
I guess this may have been one of those Google docs that people had a lot of private discussions in. This makes me rather discouraged from commenting, knowing that anything I write may have been extensively discussed already and the author just didn’t have time to or didn’t feel like incorporating those comments/viewpoints into the published document. (Some of the open questions listed seem to have fairly obvious answers. Did no one suggest such answers to the author? Or were they found wanting in some way?) Also it seems like the author is not here to participate in a public discussion, or may not have the time to do so (given his new job situation).
However I did write a bunch of comments under one of your posts (you = ricraz = Richard Ngo, if I remember correctly), which appears to cover roughly the same topic (shifts in AI risk arguments over time), and those comments may also be somewhat relevant here. Beyond that, I wonder if you could summarize what is new or different in this document compared to yours, and whether you think there’s anything in it that would be especially valuable to have a public discussion about (even absent participation of the author).
Planned summary:
Early arguments for AI safety focus on existential risk cause by a failure of alignment combined with a sharp, discontinuous jump in AI capabilities. The discontinuity assumption is needed in order to argue for a treacherous turn, for example: without a discontinuity, we would presumably see less capable AI systems fail to hide their misaligned goals from us, or to attempt to deceive us without success. Similarly, in order for an AI system to obtain a decisive strategic advantage, it would need to be significantly more powerful than all the other AI systems already in existence, which requires some sort of discontinuity.
Now, there are several other arguments for AI risk, though none of them have been made in great detail and are spread out over a few blog posts. This post analyzes several of them and points out some open questions.
First, even without a discontinuity, a failure of alignment could lead to a bad future: since the AIs have more power and intelligence their values will determine what happens in the future, rather than ours. (Here **it is the difference between AIs and humans that matters**, whereas for a decisive strategic advantage it is the difference between the most intelligent agent and the next-most intelligent agents that matters.) See also More realistic tales of doom and Three impacts of machine intelligence. However, it isn’t clear why we wouldn’t be able to fix the misalignment at the early stages when the AI systems are not too powerful.
Even if we ignore alignment failures, there are other AI risk arguments. In particular, since AI will be a powerful technology, it could be used by malicious actors; it could help ensure robust totalitarian regimes; it could increase the likelihood of great-power war, and it could lead to stronger competitive pressures that erode value. With all of these arguments, it’s not clear why they are specific to AI in particular, as opposed to any important technology, and the arguments for risk have not been sketched out in detail.
The post ends with an exhortation to AI safety researchers to clarify which sources of risk motivate them, because it will influence what safety work is most important, it will help cause prioritization efforts that need to determine how much money to allocate to AI risk, and it can help avoid misunderstandings with people who are skeptical of AI risk.
Planned opinion:
I’m glad to see more work of this form; it seems particularly important to gain more clarity on what risks we actually care about, because it strongly influences what work we should do. In the particular scenario of an alignment failure without a discontinuity, I’m not satisfied with the solution “we can fix the misalignment early on”, because early on even if the misalignment is apparent to us, it likely will not be easy to fix, and the misaligned AI system could still be useful because it is “aligned enough”, at least at this low level of capability.
Personally, the argument that motivates me most is “AI will be very impactful, and it’s worth putting in effort into making sure that that impact is positive”. I think the scenarios involving alignment failures without a discontinuity are a particularly important subcategory of this argument: while I do expect we will be able to handle this issue if it arises, this is mostly because of meta-level faith in humanity to deal with the problem. We don’t currently have a good object-level story for why the issue _won’t_ happen, or why it will be fixed when it does happen, and it would be good to have such a story in order to be confident that AI will in fact be beneficial for humanity.
I know less about the non-alignment risks, and my work doesn’t really address any of them. They seem worth more investigation; currently my feeling towards them is “yeah, those could be risks, but I have no idea how likely the risks are”.
I agree that slower makes the problem easier, but disagree about how slow is slow enough. I have pretty high confidence that a 200-year takeoff is slow enough; faster than that, I become increasingly unsure.
For example: one scenario would be that there are years, even decades, in which worse and worse AGI accidents occur, but the alignment problem is very hard and no one can get it right (or: aligned AGIs are much less powerful and people can’t resist tinkering with the more powerful unsafe designs). As each accident occurs, there’s bitter disagreement around the world about what to do about this problem and how to do it, and everything becomes politicized. Maybe AGI research will be banned in some countries, but maybe it will be accelerated in other countries, on the theory that (for example) smarter systems and better understanding will help with alignment. And thus there would be more accidents and bigger accidents, until sooner or later there’s an existential catastrophe.
I haven’t thought about the issue super-carefully … just a thought …
An alternate framing could be about changing group boundaries rather than changing demographics in an isolated group.
There were surely people in 2010 who thought that the main risk from AI was it being used by bad people. The difference might not be that these people have popped into existence or only recently started talking—it’s that they’re inside the fence more than before.
And of course, reality is always complicated. One of the concerns in the “early LW” genre is value stability and self-trust under self-modification, which has nothing to do with sudden growth. And one of the “recent” genre concerns is arms races, which are predicated on people expecting sudden capability growth to give them a first mover advantage.
I welcome more discussion of different forms of takeoff trajectory, competitive (both among AIs and among human+AI coalitions), and value-drift risks.
I worry a fair bit (and don’t really know whether the concern is even coherent) that I value individual experiences and diversity-of-values in a way that can erode if encoded and enforced in a formal way, which is one of the primary mechanisms being pursued by current research.
Thanks for making the cross-post. Do you know if the author is likely to see comments posted here, or if he prefers to receive comments another way?
I found the linked post very interesting, and seemingly useful. Thanks for cross-posting it! And shame that the author didn’t get the time to pursue the project further.
One quibble:
The author does provide hedges, such as that “these are three stylised attitudes. It’s likely that many people have an intermediate view that attaches some credence to each of these stories.” But one thing that struck me as notably missing was the following variant of the first attitude:
Indeed, my impression is that a large portion of people motivated by the discontinuity-based arguments actually see a discontinuity as less than 50% likely, perhaps even very unlikely, but not extremely unlikely. And they thus see it as a risk worth preparing for. (I don’t have enough knowledge of the community to say how large that “large portion” is.)
And this isn’t the same as the third attitude, really, because it may be that these people would shift their priorities to something else if they came to see a discontinuity as even less likely. It might not be the only lever they see as potentially worth pulling to affect the long-term future, and they might not be in properly Pascalian territory, just expected value territory.
That said, this is sort of like a blend between the first and third attitudes shown. And perhaps by “probable” the author actually meant something like “plausible”, rather than “more likely than not”. But this point still seemed to me worth mentioning, particularly as I think it’s related to the general pattern of people outside of the existential risk community assuming that those within it see x-risks as likely, whereas most seem to see them as unlikely but still a really, really big deal.