Several friends have asked me about what psychological effects I think could affect human judgement about x-risk.
This isn’t a complete answer, but in 2018 I wrote a draft of “AI Research Considerations for Human Existential Safety” (ARCHES) that included an overview of cognitive biases I thought (and still think) will impair AI risk assessments. Many cognitive bias experiments had already failed to reproduce well in the psychology reproducibility crisis, so I thought it would be a good idea to point out some that did reproduce well, and that were obviously relevant to AI risk. Unfortunately, one prospective coauthor asked that this content be removed, because of the concern that academic AI researchers would be too unfamiliar with the relevant field: cognitive science. That coauthor ultimately decided not to appear on ARCHES at all, and in retrospect I probably should have added back this material.
Just to give you a little historical context on why it was so hard to agree on content like this, getting academics to reach consensus on any acknowledgement of AI x-risk back then was like pulling teeth, even as late as 2018, because of fear of looking weird. This sort of struggle was a major reason I eventually decided to reduce my academic appointment to ~1 day/week in 2022. But since then, the CAIS statement on AI extinction risk came out in May 2023, and now at least some more people are taking the problem seriously. So I figured I’d share this, better late than never.
Notes on definitions
“Prepotent AI” is defined as, roughly speaking, uncontrollably transformative AI, i.e., transformative AI that humans are collectively unable to stop from transforming the world once it is deployed, with effects at a scale at least as significant as the industrial revolution.
“MPAI” is short for “misaligned prepotent AI”, which roughly meant, “AI that is misaligned with the objective of humanity’s collective survival and flourishing”. In other words, it’s AI that will unstoppably kill all humans after it’s deployed.
The excerpt below only discusses risk types “1b” and “1c”. For discussion of other risk types, see the full paper at https://acritch.com/arches.
Risk Type 1b: Unrecognized prepotence
Consider a scenario in which AI researchers deploy an AI system that they do not realize is prepotent. Because of their faulty understanding of the system, their understanding of the system’s alignment would also be in question. That is to say, a prepotent AI system whose prepotence was not recognized by its developers is highly likely to be misaligned as well.
The most obvious way in which the prepotence of an AI system might go unrecognized is if no well-developed scientific discipline for assessing or avoiding prepotence exists at its deployment time. In that case, developers would be forced to rely on intuitive assessments of their empirical findings, in which case a number of systematic biases might play a role in a failure to intuitively recognize prepotence:
• Illusion of control is a human bias defined as “an expectancy of a personal success probability inappropriately higher than the objective probability would warrant” [Langer, 1975]. Since 1975, a rich literature has examined the presence of illusory control in a variety of circumstances; see Presson and Benassi [1996] for a meta-analysis of 53 experiments from 29 articles. Since prepotence of an AI system by definition constitutes a loss of human control, the illusion of control effect is extremely pertinent.
• Scope insensitivity is the tendency for intuitive human judgments to be insensitive to orders of magnitude, which has been observed in numerous studies on intuitive valuations of public goods [Erick and Fischhoff, 1998] [Carson, 1997] [Veisten et al., 2004] [Hsee and Rottenstreich, 2004]. The prevention of global catastrophes is by any account a public good. Thus, absent a technically rigorous approach to evaluating both the impact and the likelihood of prepotent and/or MPAI systems, these risks are liable to be either overestimated or underestimated, with cases of under-estimation leading to greater x-risks.
Another way the prepotence of an AI system could go unrecognized is if the system is able to develop and execute a plan for deceiving its developers. That is to say, certain forms of prepotent AI systems might be capable of “tricking” a human engineer into thinking they are not prepotent. This narrative requires the AI system to be endowed with model of its own development process, as well as social acumen. Such endowments could arise by design, or from a selection process that favors them.
Supposing some research team develops a prepotent AI system that they realize or suspect is prepotent, there is some risk that the team might mistakenly overestimate the system’s alignment with the preservation of human existence.
Absent a well-developed scientific field of intelligence alignment, a development team might be tempted to fall back on intuitive interpretations of experimental results to assess the safety of their system. Reliance on intuitive judgements about alignment could render developers susceptible to any number of long-observed and highly robust cognitive biases, including:
• Escalation of commitment, also known as the “sunk cost fallacy”, the tendency of groups and individuals to escalate commitment to a failing course of action has been examined by a rich literature of management and psychological studies; see Staw [1981] and Brockner [1992] for literature reviews. Unfortunately, software development teams in particular appear to be prone to escalating commitments to failed approaches [Keil et al., 2000]. This is extremely pertinent to the risk that an AI development team, having developed a system that is prepotent but difficult to align, might continue to pursue the same design instead of pivoting to a safer approach.
• The mere-exposure effect is the psychological tendency for continued exposure to a stimulus to cause an increase in positive affect toward that stimulus. This effect was found to be “robust and reliable” by a meta-analysis of 208 independent experiments between 1968 and 1987 [Bornstein, 1989]. If the developers of a powerful AI system are susceptible to the mere-exposure effect, their exposure to the system throughout its development process could bias them toward believing it is safe for deployment.
• Optimism bias is the general tendency for people to believe that they are less at risk than their peers for many negative events, such as getting cancer, becoming alcoholics, getting divorced, or getting injured in a car accident. For meta-analyses of the phenomenon, see Klein and Helweg-Larsen [2002] and Brewer et al. [2007]; for a neurological account, see Sharot et al. [2007]. Since an MPAI deployment event is one that would lead to the death of the individuals deploying the system, it lands squarely in the domain of optimism bias: AI developers could be either overly optimistic about the safety of their systems relative to other AI systems, or believe that while their system will pose risks to some humans it would never have effects drastic enough to affect them.
Illusion of control. See Type 1b risks.
Scope insensitivity. See Type 1b risks.
(It is interesting to note that over-estimation of technological risks presents an entirely different problem for humanity: a failure to capitalize on the benefits of those technologies. However, the overestimation of risks does not currently appear to be a limiting factor in AI development, perhaps because AI researchers are currently well funded and generally optimistic about the benefits of their work. It is for this reason that we chosen to author an agenda focussed on downside risk.)
Cognitive Biases Contributing to AI X-risk — a deleted excerpt from my 2018 ARCHES draft
Preface
Several friends have asked me about what psychological effects I think could affect human judgement about x-risk.
This isn’t a complete answer, but in 2018 I wrote a draft of “AI Research Considerations for Human Existential Safety” (ARCHES) that included an overview of cognitive biases I thought (and still think) will impair AI risk assessments. Many cognitive bias experiments had already failed to reproduce well in the psychology reproducibility crisis, so I thought it would be a good idea to point out some that did reproduce well, and that were obviously relevant to AI risk. Unfortunately, one prospective coauthor asked that this content be removed, because of the concern that academic AI researchers would be too unfamiliar with the relevant field: cognitive science. That coauthor ultimately decided not to appear on ARCHES at all, and in retrospect I probably should have added back this material.
Just to give you a little historical context on why it was so hard to agree on content like this, getting academics to reach consensus on any acknowledgement of AI x-risk back then was like pulling teeth, even as late as 2018, because of fear of looking weird. This sort of struggle was a major reason I eventually decided to reduce my academic appointment to ~1 day/week in 2022. But since then, the CAIS statement on AI extinction risk came out in May 2023, and now at least some more people are taking the problem seriously. So I figured I’d share this, better late than never.
Notes on definitions
“Prepotent AI” is defined as, roughly speaking, uncontrollably transformative AI, i.e., transformative AI that humans are collectively unable to stop from transforming the world once it is deployed, with effects at a scale at least as significant as the industrial revolution.
“MPAI” is short for “misaligned prepotent AI”, which roughly meant, “AI that is misaligned with the objective of humanity’s collective survival and flourishing”. In other words, it’s AI that will unstoppably kill all humans after it’s deployed.
The excerpt below only discusses risk types “1b” and “1c”. For discussion of other risk types, see the full paper at https://acritch.com/arches.
Risk Type 1b: Unrecognized prepotence
Consider a scenario in which AI researchers deploy an AI system that they do not realize is prepotent. Because of their faulty understanding of the system, their understanding of the system’s alignment would also be in question. That is to say, a prepotent AI system whose prepotence was not recognized by its developers is highly likely to be misaligned as well.
The most obvious way in which the prepotence of an AI system might go unrecognized is if no well-developed scientific discipline for assessing or avoiding prepotence exists at its deployment time. In that case, developers would be forced to rely on intuitive assessments of their empirical findings, in which case a number of systematic biases might play a role in a failure to intuitively recognize prepotence:
• Illusion of control is a human bias defined as “an expectancy of a personal success probability inappropriately higher than the objective probability would warrant” [Langer, 1975]. Since 1975, a rich literature has examined the presence of illusory control in a variety of circumstances; see Presson and Benassi [1996] for a meta-analysis of 53 experiments from 29 articles. Since prepotence of an AI system by definition constitutes a loss of human control, the illusion of control effect is extremely pertinent.
CITATIONS:
Ellen J Langer. The illusion of control. Journal of personality and social psychology, 32(2):311, 1975.
Paul K Presson and Victor A Benassi. Illusion of control: A meta-analytic review. Journal of Social Behavior and Personality, 11(3):493, 1996.
• Scope insensitivity is the tendency for intuitive human judgments to be insensitive to orders of magnitude, which has been observed in numerous studies on intuitive valuations of public goods [Erick and Fischhoff, 1998] [Carson, 1997] [Veisten et al., 2004] [Hsee and Rottenstreich, 2004]. The prevention of global catastrophes is by any account a public good. Thus, absent a technically rigorous approach to evaluating both the impact and the likelihood of prepotent and/or MPAI systems, these risks are liable to be either overestimated or underestimated, with cases of under-estimation leading to greater x-risks.
Another way the prepotence of an AI system could go unrecognized is if the system is able to develop and execute a plan for deceiving its developers. That is to say, certain forms of prepotent AI systems might be capable of “tricking” a human engineer into thinking they are not prepotent. This narrative requires the AI system to be endowed with model of its own development process, as well as social acumen. Such endowments could arise by design, or from a selection process that favors them.
CITATIONS:
Shane Fred Erick and Baruch Fischhoff. Scope (in)sensitivity in elicited valuations. Risk Decision and Policy, 3(2):109–123, 1998.
Richard T Carson. Contingent valuation surveys and tests of lnsensitivity to scope. Studies in Risk and Uncertainty, 10:127–164, 1997.
Knut Veisten, Hans Fredrik Hoen, St ̊ale Navrud, and Jon Strand. Scope insensitivity in contingent valuation of complex environmental amenities. Journal of environmental management, 73(4):317–331, 2004.
Christopher K Hsee and Yuval Rottenstreich. Music, pandas, and muggers: on the affective psychology of value. Journal of Experimental Psychology: General, 133(1):23, 2004.
Risk Type 1b: Unrecognized misalignment
Supposing some research team develops a prepotent AI system that they realize or suspect is prepotent, there is some risk that the team might mistakenly overestimate the system’s alignment with the preservation of human existence.
Absent a well-developed scientific field of intelligence alignment, a development team might be tempted to fall back on intuitive interpretations of experimental results to assess the safety of their system. Reliance on intuitive judgements about alignment could render developers susceptible to any number of long-observed and highly robust cognitive biases, including:
• Escalation of commitment, also known as the “sunk cost fallacy”, the tendency of groups and individuals to escalate commitment to a failing course of action has been examined by a rich literature of management and psychological studies; see Staw [1981] and Brockner [1992] for literature reviews. Unfortunately, software development teams in particular appear to be prone to escalating commitments to failed approaches [Keil et al., 2000]. This is extremely pertinent to the risk that an AI development team, having developed a system that is prepotent but difficult to align, might continue to pursue the same design instead of pivoting to a safer approach.
CITATIONS:
Barry M Staw. The escalation of commitment to a course of action. Academy of management Review, 6(4):577–587, 1981.
Joel Brockner. The escalation of commitment to a failing course of action: Toward theoretical progress. Academy of management Review, 17(1):39– 61, 1992.
Mark Keil, Bernard CY Tan, Kwok-Kee Wei, Timo Saarinen, Virpi Tuunainen, and Arjen Wassenaar. A cross-cultural study on escalation of commitment behavior in software projects. MIS quarterly, pages 299–325, 2000.
• The mere-exposure effect is the psychological tendency for continued exposure to a stimulus to cause an increase in positive affect toward that stimulus. This effect was found to be “robust and reliable” by a meta-analysis of 208 independent experiments between 1968 and 1987 [Bornstein, 1989]. If the developers of a powerful AI system are susceptible to the mere-exposure effect, their exposure to the system throughout its development process could bias them toward believing it is safe for deployment.
CITATIONS:
Robert F Bornstein. Exposure and affect: Overview and meta-analysis of research, 1968–1987. Psychological Bulletin, 106(2):265, 1989.
• Optimism bias is the general tendency for people to believe that they are less at risk than their peers for many negative events, such as getting cancer, becoming alcoholics, getting divorced, or getting injured in a car accident. For meta-analyses of the phenomenon, see Klein and Helweg-Larsen [2002] and Brewer et al. [2007]; for a neurological account, see Sharot et al. [2007]. Since an MPAI deployment event is one that would lead to the death of the individuals deploying the system, it lands squarely in the domain of optimism bias: AI developers could be either overly optimistic about the safety of their systems relative to other AI systems, or believe that while their system will pose risks to some humans it would never have effects drastic enough to affect them.
Illusion of control. See Type 1b risks.
Scope insensitivity. See Type 1b risks.
(It is interesting to note that over-estimation of technological risks presents an entirely different problem for humanity: a failure to capitalize on the benefits of those technologies. However, the overestimation of risks does not currently appear to be a limiting factor in AI development, perhaps because AI researchers are currently well funded and generally optimistic about the benefits of their work. It is for this reason that we chosen to author an agenda focussed on downside risk.)
CITATIONS:
Cynthia TF Klein and Marie Helweg-Larsen. Perceived control and the optimistic bias: A meta-analytic review. Psychology and health, 17(4):437– 446, 2002.
Noel T Brewer, Gretchen B Chapman, Frederick X Gibbons, Meg Gerrard, Kevin D Mccaul, and Neil D Weinstein. Meta-analysis of the relationship between risk perception and health behavior: the example of vaccination. Health psychology: official journal of the Division of Health Psychology, American Psychological Association, 26(2):136–145, 2007.
Tali Sharot, Alison M Riccardi, Candace M Raio, and Elizabeth A Phelps. Neural mechanisms mediating optimism bias. Nature, 450(7166):102–105, 2007.
(END OF EXCERPT)
I hope you enjoyed reading this blast from the past! I believe it’s still accurate, and perhaps folks will find it a bit more relevant now.