Lock-In Threat Models

Epistemic status: a combination and synthesis of others’ work, analysed and written over a few weeks. A high-level list of threat models that is open to criticism.

TL;DR

Humanity could end up in a lock-in within the next century. Here I outline the possible routes to that outcome, and prioritise these routes on a set of criteria for importance.

Existing Work

Lukas Finnveden

AGI and Lock-In (Finnveden et al., 2022) was authored by Lukas Finnveden during an internship at Open Philanthropy. AGI and Lock-in is currently the most detailed report on lock-in risk. The report expands on notes made on value lock-in by Jess Riedel, who co-authored the report along with Carl Shulman. The report references Nick Bostrom’s initial arguments about AGI and superintelligence, and argues that many features of society can be held stable for up to trillions of years due to digital error correction and the alignment problem. Specifically, the report focuses on the technological feasibility of lock-in, and outlines the importance of AGI in the long-term stability of features of society in the future.

The authors first argue that dictators, enabled by technological advancement to be immortal, could avoid the historical succession problem, which explains the end of past totalitarian regimes, but theoretically would not prevent the preservation of future regimes. Next, whole brain emulations (WBEs) of dictators could be arbitrarily loaded and consulted on novel problems, enabling perpetual value lock-in.

They also argue that AGI-led institutions could themselves competently pursue goals with no value drift due to digital error correction. This resilience can be reinforced by distributing copies of values across space, protecting them from local destruction. Their main threat model is that If AGI is developed and is misaligned, and does not permanently kill or disempower humans, lock-in is the likely next default outcome.

Finnveden expanded on his threat models in a conversation in 2024, suggesting the following possible ways of arriving at a lock-in:

  1. “Someone has an ideology such that, if this person were to learn that further consideration will change their mind away from the ideology, they would choose to lock-in their ideology rather than go with their future self’s more informed judgement. (Maybe some religions are like this.)”

  2. “Social dynamics where people socially pressure each other to show devotion to their ideology by getting their AIs to screen out all information that could convince them to switch to some other ideology”.

  3. “We can’t build corrigible AI for some reason, and we can’t avoid AI takeover, but we have some choice about what to make our AIs believe/​value, and so that choice constitutes something to lock-in”

  4. “A large group of people are trying to reach some agreement about what to do, and for some reason it’s easier for them to reach an ‘object-level agreement’ like “these are the rules of the future” than it is for them to reach a ‘meta-level agreement’ like here’s how we’ll think more about it and make a new decision in the future” (but why would this be?)”

William MacAskill

In What We Owe the Future (MacAskill, 2022), Will MacAskill introduces the concept of longtermism and its implications for the future of humanity. It was MacAskill who originally asked Lukas Finnveden to write the AGI and lock-in report. He expands on the concepts outlined in the report in more philosophical terms in chapter 4 of his book, entitled ‘Value Lock-In’.

MacAskill defines value lock-in as ‘an event that causes a single value system, or set of value systems, to persist for an extremely long time’. He stresses the importance of current cultural dynamics in potentially shaping the long-term future, explaining that a set of values can easily become stable for an extremely long time. He identifies AI as the key technology with respect to lock-in, citing Finnveden et al. (2022). He echoes their threat models:

  1. An AGI agent with hard-coded goals acting on behalf of humans could competently pursue that goal indefinitely. The beyond-human intelligence of the agent suggests it could successfully prevent humans from doing anything about it.

  2. Whole brain emulations of humans can potentially pursue goals for eternity, due to them being technically immortal.

  3. AGI may enable human immortality; an immortal human could instantiate a lock-in that could last indefinitely, especially if their actions are enabled and reinforced by AGI.

  4. Values could become more persistent if a single value system is globally dominant. If a future world war occurs which one nation or group of nations win, the value system of the winners may persist.

Nick Bostrom

In Superintelligence (Bostrom, 2014), Nick Bostrom introduces many relevant concepts, such as value alignment and the intelligence explosion. He describes lock-in as a potential second-order effect of superintelligence developing. A superintelligence can create conditions that effectively lock-in certain values or arrangements for an extremely long time or permanently.

In chapter 5, Bostrom discusses the concept of decisive strategic advantage – that one entity may gain strategic power over the fate of humanity at large. He relates this to the potential formation of a Singleton, a single decision-making agency at the highest level. In chapter 7 he introduces the instrumental convergence hypothesis, providing insight into potential motivations of autonomous AI systems. The hypothesis suggests a number of logically implied goals an agent will develop when given an initial goal. In chapter 12, he introduces the value loading problem, and the risks of misalignment due to issues such as goal misspecification.

Bostrom frames lock-in as one potential outcome of an intelligence explosion, aside from the permanent disempowerment of humanity. He suggests that a single AI system, gaining a decisive strategic advantage, could control critical infrastructure and resources, becoming a singleton. He also outlines the value lock-in problem, where hard-coding human values into AI systems that become generally intelligent or superintelligent may lead to those systems robustly defending those values against shift due to instrumental convergence. He also notes that the frameworks and architectures leading up to an intelligence explosion might get locked in and shape subsequent AI development.

In What is a Singleton? (Bostrom, 2005), Nick Bostrom defines the Singleton, also mentioned in Superintelligence, as “a single decision-making agency at the highest level”. He explains that AI may facilitate the creation of a singleton. He explains that an agency that obtains a decisive strategic advantage through a technological breakthrough in artificial intelligence or molecular nanotechnology may use its technologically superiority to prevent other agencies catching up. It might become perpetually stable due to AI-enabled surveillance, mind control, and security. He also explains that a singleton could simply turn out to be a bad singleton – ‘If a singleton goes bad, a whole civilisation goes bad’.

Jess Riedel

In Value Lock-In Notes 2021 (Riedel, 2021), Jess Riedel provides an in-depth overview of value lock-in from a Longtermist perspective. Riedel details the technological feasibility of irreversible value lock-in, arguing that permanent value stability seems extremely likely for AI systems that have hard-coded values.

Riedel claims that ‘given machines capable of performing almost all tasks at least as well as humans, it will be technologically possible, assuming sufficient institutional cooperation, to irreversibly lock-in the values determining the future of earth-originating intelligent life.’

The report focuses on the formation of a totalitarian super-surveillance police state controlled by an effectively immortal bad person. Riedel explains that the only requirements are one immortal malevolent actor, and surveillance technology.

Robin Hanson

In this commentary on What We Owe the Future, MacAskill on Value Lock-In (Hanson, 2022), economist Robin Hanson argues that immortality is insufficient for value stability. He believes MacAskill underestimates the dangers of central power and is overconfident about the likelihood of rapid AI takeoff. Hanson presents an alternative framing of lock-in threat models:

  1. A centralised ‘take over’ process, in which an immortal power with stable values takes over the world.

  2. A decentralised evolutionary process, where as entities evolve in a stable universe, some values might become dominant via evolutionary selection. These values would outcompete others and remain ultimately stable.

  3. Centralised regulation: the central powers needed to promote MacAskill’s ‘long reflection’, limit national competition, and preserve value plurality, could create value stability through their central dominance. Also suggests this could cause faster value convergence than decentralised evolution.

Paul Christiano

In Machine intelligence and capital accumulation (Christiano, 2014), Paul Christiano proposes a ‘naǐve model’ of capital accumulation involving advanced AI systems. He frames agents as ‘soups of potentially conflicting values. When I talk about “who” controls what resources, what I really want to think about is what values control what resources.’ This provides a lens through which lock-in can be seen as a result of the values that made some features of the world stable.

He claims it is plausible that the arrival of AGI will lead to a ‘crystallisation of influence’, akin to lock-in – where whoever controls resources at that point may maintain control for a very long time. He also expresses concern that influence over the long-term future could shift to ‘machines with alien values’, leading to humanity ending with a whimper.

He illustrates a possible world in which this occurs. In a future with AI, human wages fall below subsistence level as AI replaces labour. Value is concentrated in non-labour resources such as machines, land, and ideas. The resources can be directly controlled by their owners, unlike people. So whoever owns the machines captures the resources and income, causing the distribution of resources at the time of AI to become ‘sticky’ – whoever controls the resources can maintain that control indefinitely via investment.

Synthesis

Lukas Finnveden

  • AGI not disempowering humanity, but creating a lock-in

  • AI-enabled malevolent actor creating lock-in

  • WBE of an actor creating lock-in

  • Anti-rational ideologies prevent people from changing their minds

  • Locking in AI values due to lack of corrigibility

  • A large group decide to stop making moral and decision-making progress for some reason, and decide the rules of the future

William MacAskill

  • An AGI agent with hard-coded goals acting on behalf of humans could competently pursue that goal indefinitely.

  • WBEs of humans can potentially pursue goals for eternity, due to them being technically immortal.

  • AGI may enable human immortality; an immortal human could instantiate a lock-in that could last indefinitely, especially if their actions are enabled and reinforced by AGI.

  • Values could become more persistent if a single value system is globally dominant. If a future world war occurs which one nation or group of nations win, the value system of the winners may persist.

Nick Bostrom

  • A single AI system, gaining a decisive strategic advantage, could control critical infrastructure and resources, becoming a singleton

  • An agency that makes a breakthrough in artificial intelligence technology gains a decisive strategic advantage, and creates a singleton by using its technologically superiority to prevent other agencies catching up. It might become perpetually stable due to AI-enabled surveillance, mind control, and security.

Jess Riedel

  • Totalitarian super-surveillance police state controlled by an effectively immortal bad person.

Robin Hanson

  • An immortal power (agent or agency) with stable values takes over the world

  • As entities evolve, values become stable as a result of evolutionary selection

  • The central powers needed to promote the long reflection give rise to the stability of values

Prioritisation Criteria

When defining lock-in, we identified a set of dimensions for choosing which lock-in scenarios to begin focusing on. We claim that, while it is not yet clear which scenarios would be positive, we believe lock-in scenarios with the following properties would be negative:

  1. Harmful: resulting in physical or psychological harm to individuals

  2. Oppressive: suppressing individuals’ freedom, autonomy, speech, or opportunities, or the continued evolution of culture

  3. Persistent: long-term, unrecoverable, or irreversible

  4. Widespread: concerning a significant portion of individuals relative to the total population

Using these properties to help direct our attention on the kinds of lock-in scenarios to focus on, we prioritise our lock-in threat models using the following dimensions:

  • Severe lock-in scenarios. The most harmful, oppressive, persistent, or widespread lock-in scenarios. How negative the scenario is expected to be.

  • Likely scenarios. Scenarios with a higher perceived probability of occuring. This doesn’t mean that unlikely scenarios are irrelevant, but that likelihood is one dimension to prioritise action on.

  • Imminent lock-in scenarios. Lock-in scenarios perceived as being nearest in time can be argued as most important because the persistence/​permanence of lock-in makes it hard or impossible to do anything about it once it is instantiated.

4 Fundamental Threat Models

We list the fundamental threat models synthesised from the work above according to the prioritisation criteria as follows:

  1. An autonomous AI system competently pursues a goal and prevents interference

  2. An immortal AI-enabled malevolent actor, or whole-brain emulation of a malevolent actor, instantiates a lock-in

  3. A group with sufficient decision-making power decide on the configuration of some feature of the world by majority, natural selection, or war

  4. Anti-rational ideologies prevent cultural, intellectual or moral progress, or accidentally bring about an undesirable future

These categories can be broken down into ever more specific scenarios, for which interventions can be designed.

References

  1. Bostrom, N. (2005). What is a Singleton? What Is a Singleton? https://​​nickbostrom.com/​​fut/​​singleton

  2. Bostrom, N. (2014). Superintelligence: Paths, Dangers, Strategies. Oxford University Press. https://​​books.google.co.uk/​​books?id=7_H8AwAAQBAJ

  3. Christiano, P. (2014, May 14). Machine intelligence and capital accumulation. Rational Altruist. https://​​rationalaltruist.com/​​2014/​​05/​​14/​​machine-intelligence-and-capital-accumulation/​​

  4. Finnveden, L., Jess_Riedel, & CarlShulman. (2022). AGI and Lock-In. https://​​forum.effectivealtruism.org/​​posts/​​KqCybin8rtfP3qztq/​​agi-and-lock-in

  5. Hanson, R. (2022). MacAskill on Value Lock-In. https://​​www.overcomingbias.com/​​p/​​macaskill-on-value-lock-inhtml

  6. MacAskill, W. (2022). What We Owe the Future. Basic Books. https://​​books.google.co.uk/​​books?id=nd_GzgEACAAJ

  7. Riedel, J. (2021). Value Lock-in Notes 2021.

No comments.