Cognitive Biases Contributing to AI X-risk — a deleted excerpt from my 2018 ARCHES draft
Preface
Several friends have asked me about what psychological effects I think could affect human judgement about x-risk.
This isn’t a complete answer, but in 2018 I wrote a draft of “AI Research Considerations for Human Existential Safety” (ARCHES) that included an overview of cognitive biases I thought (and still think) will impair AI risk assessments. Many cognitive bias experiments had already failed to reproduce well in the psychology reproducibility crisis, so I thought it would be a good idea to point out some that did reproduce well, and that were obviously relevant to AI risk. Unfortunately, one prospective coauthor asked that this content be removed, because of the concern that academic AI researchers would be too unfamiliar with the relevant field: cognitive science. That coauthor ultimately decided not to appear on ARCHES at all, and in retrospect I probably should have added back this material.
Just to give you a little historical context on why it was so hard to agree on content like this, getting academics to reach consensus on any acknowledgement of AI x-risk back then was like pulling teeth, even as late as 2018, because of fear of looking weird. This sort of struggle was a major reason I eventually decided to reduce my academic appointment to ~1 day/week in 2022. But since then, the CAIS statement on AI extinction risk came out in May 2023, and now at least some more people are taking the problem seriously. So I figured I’d share this, better late than never.
Notes on definitions
“Prepotent AI” is defined as, roughly speaking, uncontrollably transformative AI, i.e., transformative AI that humans are collectively unable to stop from transforming the world once it is deployed, with effects at a scale at least as significant as the industrial revolution.
“MPAI” is short for “misaligned prepotent AI”, which roughly meant, “AI that is misaligned with the objective of humanity’s collective survival and flourishing”. In other words, it’s AI that will unstoppably kill all humans after it’s deployed.
The excerpt below only discusses risk types “1b” and “1c”. For discussion of other risk types, see the full paper at https://acritch.com/arches.
Risk Type 1b: Unrecognized prepotence
Consider a scenario in which AI researchers deploy an AI system that they do not realize is prepotent. Because of their faulty understanding of the system, their understanding of the system’s alignment would also be in question. That is to say, a prepotent AI system whose prepotence was not recognized by its developers is highly likely to be misaligned as well.
The most obvious way in which the prepotence of an AI system might go unrecognized is if no well-developed scientific discipline for assessing or avoiding prepotence exists at its deployment time. In that case, developers would be forced to rely on intuitive assessments of their empirical findings, in which case a number of systematic biases might play a role in a failure to intuitively recognize prepotence:
• Illusion of control is a human bias defined as “an expectancy of a personal success probability inappropriately higher than the objective probability would warrant” [Langer, 1975]. Since 1975, a rich literature has examined the presence of illusory control in a variety of circumstances; see Presson and Benassi [1996] for a meta-analysis of 53 experiments from 29 articles. Since prepotence of an AI system by definition constitutes a loss of human control, the illusion of control effect is extremely pertinent.
CITATIONS:
• Scope insensitivity is the tendency for intuitive human judgments to be insensitive to orders of magnitude, which has been observed in numerous studies on intuitive valuations of public goods [Erick and Fischhoff, 1998] [Carson, 1997] [Veisten et al., 2004] [Hsee and Rottenstreich, 2004]. The prevention of global catastrophes is by any account a public good. Thus, absent a technically rigorous approach to evaluating both the impact and the likelihood of prepotent and/or MPAI systems, these risks are liable to be either overestimated or underestimated, with cases of under-estimation leading to greater x-risks.
Another way the prepotence of an AI system could go unrecognized is if the system is able to develop and execute a plan for deceiving its developers. That is to say, certain forms of prepotent AI systems might be capable of “tricking” a human engineer into thinking they are not prepotent. This narrative requires the AI system to be endowed with model of its own development process, as well as social acumen. Such endowments could arise by design, or from a selection process that favors them.
CITATIONS:
Risk Type 1b: Unrecognized misalignment
Supposing some research team develops a prepotent AI system that they realize or suspect is prepotent, there is some risk that the team might mistakenly overestimate the system’s alignment with the preservation of human existence.
Absent a well-developed scientific field of intelligence alignment, a development team might be tempted to fall back on intuitive interpretations of experimental results to assess the safety of their system. Reliance on intuitive judgements about alignment could render developers susceptible to any number of long-observed and highly robust cognitive biases, including:
• Escalation of commitment, also known as the “sunk cost fallacy”, the tendency of groups and individuals to escalate commitment to a failing course of action has been examined by a rich literature of management and psychological studies; see Staw [1981] and Brockner [1992] for literature reviews. Unfortunately, software development teams in particular appear to be prone to escalating commitments to failed approaches [Keil et al., 2000]. This is extremely pertinent to the risk that an AI development team, having developed a system that is prepotent but difficult to align, might continue to pursue the same design instead of pivoting to a safer approach.
CITATIONS:
• The mere-exposure effect is the psychological tendency for continued exposure to a stimulus to cause an increase in positive affect toward that stimulus. This effect was found to be “robust and reliable” by a meta-analysis of 208 independent experiments between 1968 and 1987 [Bornstein, 1989]. If the developers of a powerful AI system are susceptible to the mere-exposure effect, their exposure to the system throughout its development process could bias them toward believing it is safe for deployment.
CITATIONS:
• Optimism bias is the general tendency for people to believe that they are less at risk than their peers for many negative events, such as getting cancer, becoming alcoholics, getting divorced, or getting injured in a car accident. For meta-analyses of the phenomenon, see Klein and Helweg-Larsen [2002] and Brewer et al. [2007]; for a neurological account, see Sharot et al. [2007]. Since an MPAI deployment event is one that would lead to the death of the individuals deploying the system, it lands squarely in the domain of optimism bias: AI developers could be either overly optimistic about the safety of their systems relative to other AI systems, or believe that while their system will pose risks to some humans it would never have effects drastic enough to affect them.
Illusion of control. See Type 1b risks.
Scope insensitivity. See Type 1b risks.
(It is interesting to note that over-estimation of technological risks presents an entirely different problem for humanity: a failure to capitalize on the benefits of those technologies. However, the overestimation of risks does not currently appear to be a limiting factor in AI development, perhaps because AI researchers are currently well funded and generally optimistic about the benefits of their work. It is for this reason that we chosen to author an agenda focussed on downside risk.)
CITATIONS:
(END OF EXCERPT)
I hope you enjoyed reading this blast from the past! I believe it’s still accurate, and perhaps folks will find it a bit more relevant now.
These all make sense. I think you left out the most important set of biases for thinking clearly about alignment (and AI strategy): motivated reasoning and confirmation bias. After studying biases for a few years, I became convinced that motivated reasoning is by far the biggest problem in the world, because it creates polarized beliefs; and that it’s often mislabeled confirmation bias.
Here’s a longer, but still loose, argument for motivated reasoning as the most important bias to be aware of.
I didn’t try to get in to the empirical literature, but it does seem like MR has very large effect sizes, particularly when the real answer is difficult to evaluate.
I agree that it would be bad news from an evidential perspective if we misjudged the AI capabilities such that it’s perceived to have lower capabilities as well, but the variables of misjudging capabilities and the level of misalignment is in principle independent of each other and not correlated, so I wonder what’s going on here.