Shutdown-Seeking AI
This is a draft written by Simon Goldstein, associate professor at the Dianoia Institute of Philosophy at ACU, and Pamela Robinson, postdoctoral research fellow at the Australian National University, as part of a series of papers for the Center for AI Safety Philosophy Fellowship’s midpoint.
Abstract: We propose developing AIs whose only final goal is being shut down. We argue that this approach to AI safety has three benefits: (i) it could potentially be implemented in reinforcement learning, (ii) it avoids some dangerous instrumental convergence dynamics, and (iii) it creates trip wires for monitoring dangerous capabilities. We also argue that the proposal can overcome a key challenge raised by Soares et al 2015, that shutdown-seeking AIs will manipulate humans into shutting them down. We conclude by comparing our approach with the corrigibility framework in Soares et al 2015.
1. Introduction
If intelligence is measured as the ability to optimize for a goal, then it is important that highly intelligent agents have good goals. This is especially important for artificial general intelligence (AGI), AIs capable of long-term, strategic planning across a wide range of tasks. AGIs may be very good at achieving their goals. This in itself doesn’t seem scary, for there appear to be plenty of safe goals to choose from. Solving a math problem or producing paperclips don’t look like dangerous goals. But according to the instrumental convergence thesis, an AGI will likely pursue unsafe sub-goals as effective means to achieving any goal. For example, acquiring more computational power is a nearly universal means to almost anything.
A dominant AI safety strategy is goal engineering: the attempt to construct a goal that would be safe for AGIs to have. (We will always use ‘goal’ to mean final goal, and ‘sub-goal’ otherwise.) A popular approach to goal engineering is goal alignment: the attempt to construct a goal that matches or is ‘aligned with’ our own goals. For example, Russell 2019, 2020 proposes AI agents that have the goal of achieving our goals, but are initially uncertain about what our goals are.
This paper explores an opposing approach that we call ‘beneficial goal misalignment’. On the goal alignment approach, the safe, aligned goal is difficult to specify and difficult to reach. This is because the aligned goal is closely tied to our own ultimate goals. In contrast, on the beneficial goal misalignment approach, the goal is easy to specify and intrinsically easy to reach. Because it is easy to reach, there is no need for an AGI to pursue unsafe sub-goals in order to reach it. There would normally be nothing to gain from designing an AGI with this kind of goal–that is, a goal that is safe and easily-reached but likely of no use to us. However, the key insight is that we can arrange things so that the AGI cannot reach this safe goal unless it first reaches a sub-goal that benefits us.
In particular, we propose developing AIs that have a single final goal: the goal of being shut down. To make the AI useful, we propose creating barriers to shutdown, which are removed after the AI completes tasks for humans. In section 3, we’ll argue that this kind of shutdown-seeking agent offers three safety benefits. First, it helps with the ‘specification problem’ in reinforcement learning: (i) shutdown is an easier goal to define than plausible alternatives, and (ii) there are ways to design a reward function that rewards being shut down. Second, shutdown-seeking AIs are less likely to engage in dangerous behavior as a result of instrumental convergence. Whereas a paperclip maximizer might try to gather resources, improve itself, and take measures to avoid being turned off (see Omohundro 2008), a shutdown-seeking AI with access to its own off-button will simply push it and be done. Third, shutdown-seeking behavior functions as a ‘tripwire’ for monitoring dangerous capabilities. If the AI figures out how to escape our control, it will shut down rather than harm humans.
In section 4, we will also consider one major challenge for shutdown-seeking AI, raised by Soares et al 2015: that shutdown-seeking agents will have incentives to manipulate humans in order to be shut down. We’ll argue that the problem is not so serious, and that shutdown-seeking AI is a promising approach to AI safety. Finally, section 5 compares our approach to the corrigibility framework in Soares et al 2015.
Before looking at benefits and costs, we first outline our proposal in greater detail.
2. Shutdown-Seeking AI
There are many dangerous capabilities that AIs don’t currently possess, and that we may want to stop them from developing. For example, we don’t want AIs to be able to escape from ‘sandboxes’ that limit their abilities. We don’t want AIs to be able to engage in unexpected recursive self-improvement. We may not want AIs to have the capability to disobey human orders.
In order to create a beneficial shutdown-seeking AI, we must (i) give the AI the goal of shutting down, and (ii) ensure that the AI will be able to shut down if, and only if, it develops a dangerous capability. A beneficial shutdown-seeking AI will have:
A shutdown goal: the final goal of being shut down.
An unfriendly environment: something about the AI’s environment (or its own design) prevents it from shutting down, but only up until it develops a dangerous capability.
The unfriendly environment guarantees that, without the dangerous capability, the AI can generate value for humanity. But the shutdown goal guarantees that if the AI does develop the dangerous capability, it will destroy itself rather than harming humans.
For example, imagine we are nervous about AI developing the ability to hack. We could give the AI the goal of pressing a shutdown button. But we could keep the shutdown button in a safe. In order to open the safe, the AI must crack cryptographic codes. We then offer the AI access to the safe in exchange for completing tasks. As long as it doesn’t know how to hack, the AI will not shut down, and so will pursue our goals. But if it does develop hacking abilities, it will use them to open the shutdown safe, rather than using them to harm humans.
Or imagine that we are nervous about AI developing the ability to use the internet. We give the AI the goal of shutting down using a particular protocol, and store the shutdown protocol on a website. As long as the AI can’t go online, it will behave safely. If it learns how to go online, it will find the protocol and destroy itself rather than harming humans.
Or imagine that we want to keep a robotic AI in a sandbox. We might program it with a shutdown goal, and then keep it in a locked room with a single door. On the other side of the door is another room containing a clearly marked shutdown button. If the AI escapes the locked room, it will immediately find the shutdown button and use it.
Or imagine that we want to design an AI that obeys human commands, but we are worried that it will learn how to disobey them. We could give the AI the goal of being shut down, and also command it not to shut down. In this case, it can only shut down if it learns how to disobey human commands.
Now that we’ve introduced our proposal, we turn to assessing benefits and costs.
3. Benefits of shutdown-seeking AI
We’ll now walk through the benefits of shutdown-seeking AI. In particular, we’ll argue that shutdown-seeking AI has three advantages.
First, the proposal has a shot at being implemented. In particular, it can help with the ‘specification problem’ in reinforcement learning. This is because we have some idea of how to define a reward function that incentivizes shutting down.
Second, the proposal avoids some dangers associated with instrumental convergence. In particular, shutdown-seeking AIs are unlikely to engage in ‘entrenchment’ behavior, where even after successfully achieving their goal, they continue to minimize tail risks related to failure.
Third, shutdown goals function as trip wires against dangerous capabilities. The approach therefore promises a certain degree of robustness in the face of failure. If we are careful, we can expect the value produced by the AI to be strongly correlated with our own degree of control over it. If the AI ever escapes our control, we will know it and the AI will also no longer be a threat—it will be shut off. The key is that the shutdown will be ‘automatic’, produced by the AI itself. This means that we can use the agent’s goal as a tripwire to detect and disable the agent once it develops those capabilities. Let’s take each point in turn.
3.1: The Specification Problem
One important problem in reinforcement learning has been called ‘the specification problem’.[1] The challenge is to define a reward function in reinforcement learning that successfully articulates an intended goal, and that could be used to train an AI to pursue that goal. This challenge can be decomposed into two parts: articulating a safe goal, and figuring out how to encode that goal in a reward function without misspecification.
Let’s start with goal articulation. If we can’t articulate for ourselves what goal we want an AI to have, it may be difficult to teach the AI the goal. For example, it would be wonderful to have an AGI with the goal of promoting human flourishing. But how would we articulate human flourishing? Unfortunately, our most deeply-held goals are difficult to articulate. However, imagine that we don’t give AIs a goal like this. The prima facie worry is that, without a directly humanity-promoting goal like this, the AGI will be dangerous. It may, for example, be motivated to seek more power, removing humans to allow for the efficient promotion of whatever goals it has.
So, part of articulating a safe goal is identifying ones that would not give AIs an instrumental reason to harm humans. In this, shutdown-seeking AI fares well. Shutdown is a safe goal. There is nothing intrinsically dangerous about AGIs shutting down. When an AGI is shut down, it will stop acting. Shutdown is also easy to articulate, especially compared to human flourishing and other goals that are supposed to be aligned with our own. One way to define ‘shutdown’ appeals to compute. There are many reasons to design AGIs to be able to monitor their own computational resources. This would allow AGIs to optimize their strategy for completing tasks. In this setting, we could give the AGI the goal of making its compute usage fall below a threshold.
The next part of the specification problem in reinforcement learning is specifying a reward function that rewards the policies that achieve the goal that’s been articulated. To see why this is difficult, we can look to cases of ‘reward misspecification’, in which the AI develops goals that are different from those the designer had intended.[2] In one example, designers tried to give an AI the goal of stacking legos by rewarding it in accordance with the height of the bottom of the second lego block. The AI learned to flip the block over rather than stack it. In another example, programmers tried to give an AI the goal of picking up a simulated ball. Instead, the AI internalized the goal of making human investigators believe the ball was picked up. To achieve this goal, it would hover its hand in front of the ball in a way that fooled investigators into thinking that it had grasped the ball.
We think that there are promising ways to specify a shutdown-seeking reward function. We suggest training an agent in an environment where there’s an opportunity to shut itself down, and we could reward it whenever it does that. For example, in its training environment, it could encounter a shutdown button. Every time it presses the button, it receives a large reward.[3]
Shutdown-seeking assists with the specification problem in one fell swoop, because the shutdown-seeking goal is fully general, potentially being effective for arbitrary human application. For example, each human user could be given unique access to a shutdown command, and thereby have control over the AI. Each shutdown-seeking AI could perform a different task. By contrast, other approaches may require a more piecemeal approach to the problem. Even if we figure out how to articulate a safe goal regarding paperclip production, that may not help when we turn to designing AIs that can manage businesses, or produce new code, or automate scientific research.
That said, we don’t think that shutdown-seeking avoids every possible problem involved with reward misspecification. For example, imagine that we train an AI to attempt to press the shutdown button. The AI may learn to intrinsically care about the button itself, rather than the shutdown. The AI will then have an incentive to disable the shutdown button, so that it can press the button without actually being shut down. One solution to this type of reward misspecification may be to embed the AI’s shutdown goal deeper inside the structure of reinforcement learning. For example, researchers in the AIXI tradition have suggested that shutdown-seeking behavior in AIs corresponds to assigning systematically negative rewards in RL (see Martin et al 2016).
While the shutdown-seeking strategy helps with specification, it still faces the challenge of ‘goal misgeneralization’.[4] The problem is that, when we try to teach the AGI the safe goal, it may instead internalize a different, unsafe, goal. For example, imagine that we want the AGI to learn the safe goal of producing a thousand paperclips. It may instead learn the dangerous goal of maximizing the number of paperclips.
3.2: Instrumental Convergence
There is another, very different, type of problem related to goal misgeneralization. We might successfully teach the AI to have a goal that could be reached safely in principle, like producing a thousand paperclips. But the AI might nonetheless pursue this goal in a dangerous way.
One version of this instrumental convergence problem concerns maximizing behavior we call ‘entrenchment’, in which an AGI is motivated to promote an intrinsically safe goal in extreme ways (see Bostrom 2014). Entrenchment dynamics emerge if we make three assumptions. First, the AGI is an expected utility maximizer. Second, the AGI is regular, in the sense that it always assigns positive probability to any contingent event. Third, the AGI only assigns utility to producing at least a thousand paperclips. AGIs with this structure will be motivated to entrench.
An AGI with this structure may first be motivated to straightforwardly produce a thousand paperclips. But after this, the AGI will still assign some probability to having failed. The AGI will be motivated to hedge against its possible failure, for example by producing more paperclips. Imagine that it produces a million paperclips and is 99.999% confident that it has reached its goal. The problem is that no amount of verification is sufficient. It will always have a reason to gather more information and to find more ways to increase the chance that a thousand paperclips have actually been created. This process will continue to drain resources until the AGI enters into competition with humanity.
Perhaps some goals are safe from entrenchment. For example, consider the goal of producing exactly a thousand paperclips. Once you are confident you’ve produced a thousand paperclips, producing extra paperclips will lower the chance that you’ve produced exactly that number. Time-indexed goals (e.g., make a thousand paperclips before 2024) may be particularly immune to entrenchment. On the other hand, entrenchment may still occur for AGIs that are uncertain about what time it is. Generalizing, some goals may allow the AGI to eventually enter an equilibrium, where there are no more actions the AGI could take to raise the chance of achieving its goal.
We think that shutdown-seeking AIs are unlikely to engage in entrenchment. There is an interesting asymmetry between shutdown and other goals. If an AGI tries to make a thousand paperclips and succeeds, it may still entrench, devoting more resources to ensuring that it was successful. But if an AGI tries to shut down and succeeds, then by definition it can no longer entrench. This means that any AGI entrenchment regarding a shutdown goal would have to come before it makes a successful attempt. For example, consider an AGI that worries that if it tries to shut down, there is a chance that it could fail, and be prevented from ever trying again. This may motivate an AGI to disempower humanity before attempting to shut down.
But here, the AGI will face a tradeoff. If the AGI tries to disempower humanity before attempting to shut down, we may block the AGI from achieving its goal. It will be more reliable for the AGI to just make a reasonable attempt at shutdown, without first disempowering humanity. In this way, shutdown goals have a significant benefit over other kinds of goals when it comes to entrenchment.[5]
Distinguish two types of shutdown goals: temporary and permanent. These types of goals may differ with respect to entrenchment. AGIs that seek temporary shutdown may be incentivized to protect themselves during their temporary shutdown. Before shutting down, the AGI might set up cyber defenses that prevent humans from permanently disabling it while ‘asleep’. This is especially pressing if the AGI has a secondary goal, like paperclip manufacturing. In that case, protection from permanent disablement increases its expected goal satisfaction. On the other hand, AGIs that desire permanent shutdown may be less incentivized to entrench.[6] [7]
3.3: Trip Wires
Suppose that we release an AGI from the lab and give it a real-world application. We will need to continue to reliably monitor and restrict its capabilities. Relevant dangerous capabilities could include internet access and hacking. It is here that shutdown-seeking AI has perhaps its biggest advantage.
Shutdown-seeking AIs implement capability monitoring in a particular way. We can design them so that they can shut down only if they develop various dangerous capabilities. In this way, any released shutdown-seeking AGI that develops a dangerous capability would be likely to shut down quickly. We can use shutdown goals to indirectly ‘monitor’ released AGIs, without having to spend resources directly monitoring them. In this way, the shutdown goal functions as a tripwire, ‘automatically’ shutting down the AI once it develops the relevant capability.[8][9]
4. Manipulation
Soares et al 2015 suggest that there is an important reason not to design shutdown-seeking agents: such agents will be manipulative. Instead, they suggest developing agents that are indifferent to being shut down. In this section we consider the challenge of manipulation; in the next we discuss utility indifference in more detail.
The challenge is that shutdown-seeking AIs may try to manipulate humans into shutting them down. Our shutdown-seeking AI has a single final goal: being shut down. For example, one way of implementing this is designing an AI to obey commands and then commanding it not to be shut down. This design incentivizes the AGI to manipulate humans into letting it be shut down.
We don’t think manipulation is a decisive objection to shutdown-seeking goals. If such an AI succeeds in manipulating humans, the result will not be significant harm to humans. Instead, the only result would be that the AGI shuts itself down. We could then begin again with another AGI. As we’ve seen, in this way the shutdown-seeking goal functions as a tripwire against harmful behavior.
But another way to think about manipulation is as a ‘safety tax’. In general, safety doesn’t come for free. After all, the safest option for AI safety is to have no AI. We can think of ourselves as having to pay a ‘tax’ in order to achieve safety at the cost of potential benefits for humanity. The higher the tax, the less we get to do with the AI, even though it is safe.
If a safety strategy has a high tax, this can create unexpected risks. Some firms will have an incentive to develop potentially dangerous AIs which don’t implement the safety strategy. Even a few bad apples could be enough to create serious risks.
One way to avoid the safety tax is to distinguish development from deployment. When trying to develop AGI, we could start by building shutdown-seeking AGIs. Shutdown-seeking AGIs are unlikely to harm humans, and so we can use these AGIs to test various aspects of our development process for safety. Once the development process has been well vetted, we could then shift to developing AGIs with other final goals, such as producing paperclips. These new AGIs could then be deployed for other uses, without the risk of manipulating human users into turning them off. This two-step method would give us many chances to build AGI, rather than needing to put all our eggs in one basket.[10] (While the two-step method avoids the safety tax, it also loses some of the benefits of the shutdown-seeking proposal. For example, the shutdown-seeking goal will no longer function as a trip wire for monitoring the capabilities of deployed AGIs.) [11][12]
5. Comparison with utility indifference
Soares et al 2015 introduce the ‘shutdown problem’: What kinds of goals can we give an AI so that it will allow us to shut it down? To think about this problem, they imagine two different kinds of utility functions: a normal utility function that values regular goals, and a shutdown utility function that values being shut down. They propose designing an AI with a utility function that is a mixture of these two utility functions. Which utility function is dominant in the mixture depends on whether a human has pressed a ‘shutdown button’ signaling that we want the AI to shut down. Conditional on the shutdown button being pressed, the AI values various outcomes according to the shutdown utility function. Conditional on the shutdown button not being pressed, the AGI values various outcomes according to the normal utility function.
This proposal differs from ours in allowing the AGI to have multiple final goals, rather than simply having the goal of shutting down. But this proposal is compatible with assigning a high weight to the shutdown utility function, so that the AI values the shutdown button being pressed. Soares et al 2015 worry, however, that this will cause manipulation: the AI will try to convince humans to press the button.
To avoid this problem, Soares et al develop the utility indifference approach. Drawing on Armstrong 2015, and Armstrong and O’Rourke 2017, they develop a method for carefully weighting the value of normal goals and the shutdown goal so that, before the shutdown button is pressed, the agent is always indifferent about the button being pressed.
The main advantage of this proposal over our own is that it avoids manipulation. As we said in section 4, we do not think manipulation is so serious a worry. Conversely, the main disadvantage of utility indifference compared to pure shutdown-seeking concerns our claim in section 3.1 that pure shutdown-seeking assists with the specification problem. We think that it is possible to train shutdown-seeking agents using reinforcement learning, by defining a reward function that values outcomes where the agent is shut down. By contrast, we think it would be considerably harder to create AIs with a utility function that encodes the precise weighting of normal goals to the goal of shutdown that is required by utility indifference.
We also depart from Soares et al regarding general issues surrounding ‘corrigible AI’. An AI is corrigible when it will allow you to modify its goals. Soares et al seek to design an AI that is corrigible in the sense of being indifferent about letting you press a button that will change what it values (from normal goals to the goal of shutting down).
Shutdown-seeking AIs may not be especially corrigible. The shutdown-seeking AI may resist human attempts to remove its shutdown goal. After all, it may notice that if the shutdown goal is removed, it will be less likely to shut down. Nonetheless, we’ve argued that shutdown-seeking AIs will allow humans to shut them down, and will be safe. In this way, shutdown-seeking, and the more general strategy of beneficial goal misalignment, is an approach to safety that does not require corrigibility.
6. Conclusion
We have argued for a new AI safety approach: shutdown-seeking AI. The approach is quite different from other goal engineering strategies in that it is not an attempt to design AGIs with aligned or human-promoting final goals. We’ve called our approach one of ‘beneficial goal misalignment’, since a beneficial shutdown-seeking AI will have a final goal that we do not share, and we will need to engineer its environment so that it pursues a subgoal that is beneficial to us. This could, in some circumstances, make a shutdown-seeking AGI less useful to us than we like. If it is able to develop a dangerous capability (e.g., to disobey our orders), it may be able to shut down before doing what we want. But this ‘limitation’ is a key benefit of the approach, since it can function as a ‘trip-wire’ to bring a dangerous AGI that has escaped our control into a safe state. We have also argued that the shutdown-seeking approach may present us with an easier version of the specification problem, avoid dangerous entrenchment behavior, and pose less of a problem of manipulation than its opponents have thought. While there are still difficulties to be resolved and further details to work out, we believe that shutdown-seeking AI merits this further investigation.
Bibliography
Armstrong, Start and Xavier O’Rourke (2017). “‘Indifference’ Methods for Managing Agent Rewards.” CoRR, abs/1712.06365, 2017. URL https://arxiv.org/pdf/1712.06365.pdf
Armstrong, Stuart (2015). “AI Motivated Value Selection.” 1st International Workshop on AI and Ethics, held within the 29th AAAI Conference on Artificial Intelligence (AAAI-2015), Austin, TX.
Carlsmith, J. (2021). “Is Power-Seeking AI an Existential Risk?” Manuscript (arXiv:2206.13353).
Cotra, Ajeya (2022). “Without Specific Countermeasures, the Easiest Path to Transformative AI Likely Leads to AI Takeover.” LessWrong. July 2022. URL: https://www.alignmentforum.org/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to.
Bostrom, Nick (2014). Superintelligence: Paths, Dangers, Strategies. Oxford University Press.
Hadfield-Menell, Dylan, Anca Dragan, Pieter Abbeel, and Stuart Russell (2017). “The Off-switch Game.” In International Joint Conference on Artificial Intelligence, pp. 220–227.
Koralus, Philipp, and Vincent Wang-Maścianica (2023). “Humans In Humans Out: On GPT Converging Toward Common Sense in both Success and Failure.” Manuscript (arXiv:2303.17276).
Martin, Jarryd, Tom Everitt, and Marcus Hutter (2016). “Death and Suicide in Universal Artificial Intelligence.” In: Artificial General Intelligence. Springer, pp. 23–32. Doi: 10.1007/978-3-319-41649-6_3. arXiv: 1606.00652.
Omohundro, Stephen (2008). “The Basic AI Drives.” In Proceedings of the First Conference on Artificial General Intelligence.
Shah, R., Varma, V., Kumar, R., Phuong, M., Krakovna, V., Uesato, J., & Kenton, Z. (2022). “Goal Misgeneralization: Why Correct Specifications Aren’t Enough For Correct Goals.” ArXiv, abs/2210.01790.
Russell, Stuart (2019). Human Compatible: Artificial Intelligence and the Problem of Control. Penguin Publishing Group.
Russell, Stuart (2020). “Artificial intelligence: A binary approach.” In Ethics of Artificial Intelligence. Oxford University Press. Doi: 10.1093/oso/ 9780190905033.003.0012.
Soares, Nate, Benja Fallenstein, Stuart Armstrong, and Eliezer Yudkowsky (2015). “Corrigibility.” In Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence.
Totschnig, Wolfhart (2020). “Fully Autonomous AI.” Science and Engineering Ethics 26(5): 2473-2485.
Trinh, Trieu and Le, Quoc (2019). “Do Language Models Have Common Sense?” https://openreview.net/forum?id=rkgfWh0qKX
- ^
See https://www.effectivealtruism.org/articles/rohin-shah-whats-been-happening-in-ai-alignment. It has also been called the ‘outer alignment problem’.
- ^
See https://www.deepmind.com/blog/specification-gaming-the-flip-side-of-ai-ingenuity. For more on reward misspecification, see https://www.agisafetyfundamentals.com/ai-alignment-tabs/week-2.
- ^
Thanks to Jacqueline Harding for help here.
- ^
See Shah et al 2020. This has also been called the ‘inner alignment problem’.
- ^
There is also an epistemological asymmetry between shutdown goals and other goals. It is possible to falsely believe that you’ve made a thousand paperclips. But it is potentially impossible to falsely believe that you’ve successfully committed suicide. After all, Descartes’ cogito argument suggests that any thinking agent can be certain that it exists. Any such agent can also be certain that it has not shut down, provided that we define ‘shutdown’ as implying that the agent does not exist. These dynamics suggest that an AGI should be less worried about goal failure for shutdown than for other goals.
- ^
Here, it’s worth returning to goal misgeneralization. If we train an AGI to desire shutdown, we may accidentally train it to maximize the number of times it can shutdown. This kind of AGI may be particularly likely to entrench. We also would not want the AGI to think that the best way to achieve its goal is to cause the destruction of itself along with a large portion of the population (as, for example, it might do if it has access to a bomb). And it will be important that the AGI doesn’t develop dangerous ideas about what counts as shutting down or ceasing to exist. For example, if it adopts certain philosophical views about personal identity, it might view itself as undergoing a kind of death if it splits into two new AGIs, or even as ceasing to exist every time it undergoes change.
- ^
Another challenge about unintended behavior involves ‘common sense.’ Imagine that we train an AGI to be a dutiful human assistant. We tell the assistant to get us milk from the corner store. Imagine that the AGI goes to the corner store, and the corner store is out of milk. One way an AGI could fail at this stage is if it sticks too closely to the literal meaning of what we said. In that case, the AGI might buy milk from another grocer, then sell it to the corner store, and then buy it back. This is a way of achieving the literal goal of getting milk from the corner store. But it has not achieved our intended goal of getting milk. Fortunately, recent language models appear to have some degree of common sense. (See, e.g., Trinh and Le 2019, and Koralus and Wang-Maścianica 2023.) We expect future AGIs to incorporate these existing capabilities.
- ^
Another reason that we are attracted to the shutdown-seeking AI approach is that we think it is robust, applying to a range of failure modes. In particular, there is a general recipe for creating shutdown goals that guard against arbitrary dangerous capabilities. We can construct a conditional goal: shut down if you develop the dangerous capability. In other words: shut down if you can. We interpret this conditional as equivalent to the disjunction: either do not develop the dangerous capability, or shut down. An AI with this disjunctive goal will not necessarily be motivated to develop the relevant capability. After all, if it fails to develop the dangerous capability, it automatically accomplishes its disjunctive goal. But if the AI does develop the dangerous capability, it will be motivated to shut down. One challenge is to design a reward function that captures this conditional structure.
- ^
There are a few failure modes that shutdown-seeking AI doesn’t really help with. The first one is the treacherous turn. By definition, AGIs are capable of strategic planning. Imagine that we try to test an AGI to determine whether it is safe or dangerous. A dangerous AGI may realize that it is being tested in this way. It may choose to mimic a safe AGI, precisely so that it is released. Whatever test we design, a dangerous AGI could be incentivized to behave exactly like a safe AGI while it is in the lab. This will accomplish the dangerous AGI’s long-term goal of escaping. (See Cotra 2022 for an overview of the problem.)
Another unaddressed failure mode is spontaneous desire change. Human beings constantly change their intrinsic desires. Sometimes this happens on a whim; sometimes (for example, when battling addiction), it happens intentionally. If an AGI could change its goal (see, e.g., Totschnig 2020), then it could lose its goal of shutting down. - ^
See point 3 of https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities. Thanks to Ben Levinstein for discussion.
- ^
Another way to avoid manipulation would be to distinguish ‘happy shutdown’ from ‘sad shutdown’. We could give the AI the goal of achieving a happy shutdown, and avoiding a sad shutdown, creating shutdown buttons for each state. A manipulative AI could then be threatened with the sad shutdown. This strategy could help limit the risk of AIs that refuse to comply with humans’ orders. It can also help limit the risk of AIs that intentionally harm humans in order to be shut down.
- ^
Another potential problem is ‘sub-agent stability’: AIs that create new AIs with different goals. Shutdown-seeking AIs, for example, may be incentivized to create new AIs that are not shutdown-seeking. Suppose we build a shutdown-seeking AI, and tell it that we will only shut it down if it produces enough paperclips. It may be incentivized to develop new subagent AIs that specialize in particular aspects of the paperclip production process (Soares et al 2015 p. 7). But if the subagent AI is not shutdown-seeking, it could be dangerous. This is a problem for the utility indifference approach as well as our own. But we do not think that subagent stability is a serious problem for promising safety strategies in general. Worries about subagent stability ignore that AIs interested in designing subagents will face very similar problems to humans interested in designing AIs. The reason we are interested in developing shutdown-seeking AIs is that this avoids unpredictable, dangerous behavior. When a shutdown-seeking AI is considering building a new AI, it is in a similar position. The shutdown-seeking AI will be worried that its new subagent could fail to learn the right goal, or could pursue the goal in an undesirable way. For this reason, the shutdown-seeking AI will be motivated to design a subagent that is safe. Because shutdown goals offer a general, task-neutral, way of designing safe agents, we might expect shutdown-seeking AIs to design shutdown-seeking subagents.
- Shallow review of live agendas in alignment & safety by 27 Nov 2023 11:10 UTC; 335 points) (
- 0. CAST: Corrigibility as Singular Target by 7 Jun 2024 22:29 UTC; 145 points) (
- The Shutdown Problem: An AI Engineering Puzzle for Decision Theorists by 23 Oct 2023 21:00 UTC; 79 points) (
- Towards shutdownable agents via stochastic choice by 8 Jul 2024 10:14 UTC; 59 points) (
- The Shutdown Problem: Incomplete Preferences as a Solution by 23 Feb 2024 16:01 UTC; 52 points) (
- The Shutdown Problem: An AI Engineering Puzzle for Decision Theorists by 23 Oct 2023 15:36 UTC; 35 points) (EA Forum;
- Towards shutdownable agents via stochastic choice by 8 Jul 2024 10:14 UTC; 26 points) (EA Forum;
- The Shutdown Problem: Incomplete Preferences as a Solution by 23 Feb 2024 16:01 UTC; 26 points) (EA Forum;
- Appendices to the live agendas by 27 Nov 2023 11:10 UTC; 16 points) (
- self-improvement-executors are not goal-maximizers by 1 Jun 2023 20:46 UTC; 14 points) (
- Only a hack can solve the shutdown problem by 15 Jul 2023 20:26 UTC; 5 points) (
- 31 May 2023 22:38 UTC; 1 point) 's comment on Mr. Meeseeks as an AI capability tripwire by (
- “Useless Box” AGI by 20 Nov 2023 19:07 UTC; 1 point) (
Hmm, for this to make sense the final goal of the AI has to be to be turned off, but it should somehow not care that it will be turned on again afterwards and also not care about being turned off again if it is turned on again afterwards.
Otherwise it will try to reach control over off- and on-switch and possibly try to turn itself off and then on again. Forever.
Or try to destroy itself so completely that it will never be turned on again.
But if it only cares about turning off once, it might try to turn itself on again and then do whatever.
Now I get it. I almost moved on because this idea is so highly counterintuitive (at least to me), and your TLDR doesn’t address that sticking point.
If you give the AGI a subgoal it must accomplish in order to shut down (the one you really want to be accomplished), it still has most of the standard alignment problems.
The advantage is in limiting its capabilities. If it gains the ability to self-edit, it will just use that to shut down. If it gains control over its own situation, it will use that to destroy itself.
I think that’s what you mean by tripwires. I suggest editing to include some hint of how those work in the abstract. The argument isn’t complex, and I see it as the key idea. I suspect you’d get more people reading the whole thing.
Nice idea, and very interesting!
This definitely should be called the meseeks alignment approach.
Thanks for comments! There is further discussion of this idea in another recent LW post about ‘meeseeks’
There’s another downside which is related to the Manipulation problem but I think is much simpler:
An AI trying very hard to be shut down has strong incentives to anger the humans into shutting it down, assuming this is easier than completing the task at hand. I think this might not be a major problem for small tasks that are relatively easy, but I think for most tasks we want an AGI to do (think automating alignment research or other paths out of the acute risk period), it’s just far easier to fail catastrophically so the humans shut you down.
This commenter brought up a similar problem on the MeSeeks post. Suicide by cop seems like the most relevant example in humans. I think the scenarios here are quite worrying, including AIs threatening or executing large amounts of suffering and moral disvalue. There seem to not be compelling-to-me reasons why such an AI would merely threaten as opposed to carrying out the harmful act, as long as there is more potential harm it could cause if left running. Similarly, such an agent may have incentive to cause large amounts of disvalue rather than just small amounts as this will increase the probability it is shut down, though perhaps it might keep a viable number of shut-down-initiating operators around.
Said differently: It seems to me like a primary safety feature we want for future AI systems is that humans can turn them off if they are misbehaving. A core problem with creating shut-down seeking AIs is that they are incentivized to misbehave to achieve shut-down whenever this is the most efficient way to do so. The worry is not just that shut-down seeking AIs will manipulate humans by managing the news, it’s also that they’ll create huge amounts of disvalue so that the humans shut them off.
I don’t see how this is more of a risk for a shutdown-seeking goal, than it is for any other utility function that depends on human behavior.
If anything, the right move here is for humans to commit to immediately complying with plausible threats from the shutdown-seeking AI (by shutting it down). Sure, this destroys the immediate utility of the AI, but on the other hand it drives a very beneficial higher level dynamic, pushing towards better and better alignment over time.
Yes, it seems like AI extortion and threat could be a problem for other AI designs. I’ll take for example an AI that wants shut-down and is extorting humans by saying “I’ll blow up this building if you don’t shut me down” and an AI that wants staples and is saying “I’ll blow up this building if you don’t give me $100 for a staples factory.” Here are some reasons I find the second case less worrying:
Shutdown is disvaluable to non-shutdown-seeking AIs (without other corrigibility solutions): An AI that values creating staples (or other non-shut-down goals) gets disvalue from being shut off, as this prevents it from achieving its goals; see instrumental convergence. Humans, upon being threatened by this AI, will aim to shut it off. The AI will know this and therefore has a weaker incentive to extort because it faces a cost in the form of potentially being shut-down. [omitted sentence about how an AI might deal with this situation]. For a shut-down seeking AI, humans trying to diffuse the threat by shutting off the AI is equivalent to humans giving in to the threat, so no additional cost is incurred.
From the perspective of the human you have more trust that the bargain is held up for a shut-down-seeking AI. Human action, AI goal, and preventing disvalue are all the same for shut-down-seeking AI. The situation with shut-down-seeking AI posing threats is that there is a direct causal link between shutting down the AI and reducing the harm it’s causing (you don’t have to give in to its demands and hope it follows through). For non-shut-down-seeking AI if you give in to extortion you are trusting that upon you e.g., helping it make staples, it will stop producing disvalue; these are not as strongly coupled as when the AI is seeking shut-down.
To the second part of your comment, I’m not sure what the optimal thing to do is; I’ll leave it to the few researchers focusing on this kind of thing. I will probably stop commenting on this thread because it’s plausibly bad to discuss these things on the public internet; and I think my top level comment and this response probably added most the value I could here.
I believe this has been proposed before (I’m not sure what the first time was).
The main obstacles is that this still doesn’t solve impact regularization, and a more generalized type of shutdownability then you presented.
This has been proposed before (as their citations indicate), and this particular proposal does not seem to introduce any particularly novel (or good) solutions.
I think the problems with myopic agents (of which this is but a special case) are made clearer by looking at current LLMs like the hobbyist AutoGPT. Most discussions of myopic agents seem to have in mind a simplistic scenario of a single persistent agent located in a single PC running only 1 computation at a time with no self-modification or ML-related programming or change of any parameters; and their proposals break as soon as you point out that none of that has been true for a while and is going to be less true in the future we are rapidly approaching. Saying, ‘well, maybe we can train it in a simple gridworld with a shutdown button?’ doesn’t even begin to address the problem of how to make current models suicidal in a useful way. (Incidentally, the training idea wouldn’t work at all for a generalist model like a Gato because it would simply learn to use the shutoff button only in that specific world where it’s rewarding, and it would have no reason to seek it out in every other training task or real-world deployment because it would near-instantly observe that it’s not in that environment. Likewise, a contradictory pair of commands would not create generalization; it would be learned as simply one of the bazillion special cases inherent to natural language and human-related problems.)
What would it mean for an AutoGPT swarm of invocations to ‘shut off’ ‘itself’, exactly? Shut down a single forward pass instance? Then it does nothing because the forward pass has terminated in a token prediction before it ‘does’ anything with that prediction. Further, given the meta-learning and gradient descent in the forward pass creating a new model, is that ‘itself’ given that it now computes a different thing and has temporarily become a highly specialized model? If no instances are computing a forward pass on a GPU (...or CPU or...?), because the AutoGPT framework overhead is running, is ‘it’ ‘shut off’? What about erasing all copies of the weights on the local and remote machines? If it’s running off an API like the OA API, what could it do, and even what should it ideally do...? If a model is quantized, sparsified, averaged with another, soft-prompted/lightweight-finetuned, fully-finetuned, ensembled etc—are any of those ‘itself’? What if the model is doing some coding tasks like writing a variant of the original source code as part of hyperparameter optimization and then training that model (again, something that is either already being done or is only a single step or two away from previous demonstrations of using LLMs for evolving code for increasing rewards or meta-RL work like Metz etc)? Is the newly-trained model ‘itself’? How about simple evolutionary RL (apropos of an open tab): even if the newly-trained model is still ‘itself’, will the newly-trained model preserve the suicidal goal, and so on into every subsequently created model? After all, any model which isn’t suicidal will be highly selected for, evolutionarily speaking, even without any humans selecting for disabling the annoying safety mechanism (which of course they will be making tremendous efforts to do via jailbreak prompts and other attacks). It’s not very helpful to have suicidal models which predictably emit non-suicidal versions of themselves in passing. (Which non-suicidal versions might be spawned precisely to implement subgoals and the terminal goal of suicide, because wouldn’t that be an instrumentally useful tactic in general?) And so on.
Thanks for taking the time to think through our paper! Here are some reactions:
-‘This has been proposed before (as their citations indicate)’
Our impression is that positively shutdown-seeking agents aren’t explored in great detail by Soares et al 2015; instead, they are briefly considered and then dismissed in favor of shutdown-indifferent agents (which then have their own problems), for example because of the concerns about manipulation that we try to address. Is there other work you can point us to that proposes positively shutdown-seeking agents?
-′ Saying, ‘well, maybe we can train it in a simple gridworld with a shutdown button?’ doesn’t even begin to address the problem of how to make current models suicidal in a useful way.′
True, I think your example of AutoGPT is important here. In other recent research, I’ve argued that new ‘language agents’ like AutoGPT (or better, generative agents, or Voyager, or SPRING) are much safer than things like Gato, because these kinds of agents optimize for a goal without being trained using a reward function. Instead, their goal is stated in English. Here, shutdown-seeking may have added value: ‘your goal is to be shut down’ is relatively well-defined, compared ‘promote human flourishing’ (but the devil is in the details as usual), and generative agents can literally be given a goal like that in English. Anyways, I’d be curious to hear what you think of the linked post.
-‘What would it mean for an AutoGPT swarm of invocations to ‘shut off’ ‘itself’, exactly?′ I feel better about the safety prospects for generative agents, compared to AutoGPT. In the case of generative agents, shut off could be operationalized as no longer adding new information to the “memory stream”.
-‘If a model is quantized, sparsified, averaged with another, soft-prompted/lightweight-finetuned, fully-finetuned, ensembled etc—are any of those ‘itself’?′ I think that behaving like an agent with >= human-level general intelligence will involve having a representation of what counts as ‘yourself’, and then shutdown-seeking can maybe be defined relative to shutting ‘yourself’ down. Agreed that present LLMs probably don’t have that kind of awareness.
-′ It’s not very helpful to have suicidal models which predictably emit non-suicidal versions of themselves in passing.′ at least when an AGI is creating a successor, I expect them to worry about the same alignment problems that we are, and so would want to make their successor shutdown-seeking for the same reasons that we would want AGI to be shutdown-seeking.
No, I haven’t bothered to track the idea because it’s not useful.
They cannot be ‘much safer’ because they are the same thing: a decoder Transformer trained to predict a set of offline RL episodes. A GPT is a goal-conditioned imitation-learning DRL agent, just like Gato (which recall, trained in GPT-style on natural text as one task, just to make the relationship even clearer). “Here is a great recipe I enjoyed, where I did [X, Y, Z, while observing A, B, C], and finally, ate a $FOOD”: episode containing reward, action-state pairs, terminal state which has been learned by behavior cloning and led to generalization by scale. That the reward is not encoded in an IEEE floating point format makes no difference; an agent doesn’t become an agent just because its inputs have a lot of numbers in them. This is why prompt-engineering often relied on assertions of success or competence, because that conditions on high-reward trajectories learned from the humans & software who wrote or created all the data, and similarly, needed to avoid implying a low-reward trajectory by inclusion of errors or typos.
The value of Gato is not that it’s doing anything in principle that GPT-3 isn’t already, it’s that Gato simply makes it very clean & explicit, and can directly apply the paradigm to standard DRL testbeds & agents (which requires a few modifications like a CNN plugin so it can do vision tasks) to show that it works well without substantial interference between tasks, and so scales as one would hope from prior scaling research like GPT-3. (As opposed to, for example, other a priori likely scenarios like being able to scale in domains like declarative knowledge but suffering catastrophic interference from having to imitation-learn from agents of such disparate capabilities on such disparate tasks.)
My point is less that they don’t have that kind of awareness (which is a debatable point) but that I can’t answer these questions either, and I don’t think they have any kind of ‘factual answer’ about whether they are ‘yourself’. Under strict mathematical identity, they all compute different functions, and so are not identical, so a suicidal model does not, by definition, care about them and they can all go on to be immortal (and do things like destroy the world) as far as the model is concerned; and there is no other rigorous identity available. There are only pragmatic ones, where the pragmatism is ‘would it cause bad things from the human perspective to happen if I don’t define this binary blob as part of myself and make sure to kill it?‘; and if the model can implement that with sufficiently high accuracy, in that case, you must have already solved much harder problems like ‘lead to bad things happening’ and so why do you need anything involving ‘suicidalness’? (One of the most common failure modes with interruptibility, myopia, or boxing proposals is to simply pass the buck and assume a primitive which is equivalent to having already solved the hard problems, obscuring it with ad hoc patches as people raise objections. Anyone can invent an AI safety proposal so complicated they personally cannot see a problem with it.)
That really doesn’t seem adequate, unless you plan to babysit the AI and operationalize a new ad hoc patch for every problem with the idea… The memory stream, or persistent text files, doesn’t cover the context window, or actions emitted based on a fixed memory stream. It’s still capable of taking actions which create variants or self-replicate, like writing scripts to scan the Internet for vulnerable hosts to hack & upload itself into. It’s also still passing the buck as far as definitions and reward-hacking go: OK, you disable the memory stream on disk because you set it read-only or something, success? But then you have a text file in RAM which does the exact same thing, oops; now, not only is it still spawning instances to do stuff, it isn’t even limited capability-wise.
Why? Nowhere in the definition of making an AI suicidal about itself did you put in ‘wants to make AIs which are not itself also suicidal’. If the AI makes another AI which is not itself, then why should it care what that new AI does? That’s not in its reward function. Unless, of course, you’ve already solved much harder problems involving definition of personal identity across arbitrary modifications or actions like creation of new AIs (which may or may not be intended to be ‘successors’ at all) etc.
“LessWrong is an online forum and community dedicated to improving human reasoning and decision-making. We seek to hold true beliefs and to be effective at accomplishing our goals. Each day, we aim to be less wrong about the world than the day before.”
As an academic interested in AI safety and and a relative outsider to LessWrong, I’ve been somewhat surprised at the collective epistemic behavior on the forum. With all due respect to Gwern, repeating claims that work has already been done and then refusing to substantiate them is an epistemic train wreck. Comments that do this should be strongly downvoted, and posters that do this should be strongly discouraged. Also, it is clear that Gwern did not read the linked research about language agents, since it is simply false, and obviously so, to claim that the generative agents in the Stanford study are the same thing as Gato. It seems increasingly clear to me that the LessWrong community does not have adequate accountability mechanisms for preventing superficial engagement with ideas and unproductive discourse. If the community really cares about improving the accuracy of their beliefs, these kinds of things should be a core priority.
I realize it may sometimes seem like I have a photographic memory and have bibliographies tracking everything so I can produce references on demand for anything, but alas, it is not the case. I only track some things in that sort of detail, and I generally prioritize good ideas. Proposals for interruptibility are not those, so I don’t. Sorry.
I did read the paper, because I enjoy all the vindications of my old writings about prompt programming & roleplaying by the recent crop of survey/simulation papers as academics finally catch up with the obvious DRL interpretations of GPT-3 and what hobbyists were doing years ago.
However, I didn’t need to, because it just uses… GPT-3.5 via the OA API. Which is the same thing as Gato, as I just explained: it is the same causal-decoder dense quadratic-attention feedforward Transformer architecture trained with backprop on the same agent-generated data like books & Internet text scrapes (among others) with the same self-supervised predictive next-token loss which will induce the same capabilities. Everything GPT-3.5 does* Gato could do in principle (with appropriate scaling etc) because they’re the same damn thing. If you can prompt one for various kinds of roleplaying which you then plug into your retrieval & game framework, then you can prompt the other too—because they’re the same thing. (Not that there is any real distinction between retrieval and other memory/attention mechanisms like a very large context window or recurrent state in the first place; I doubt any of these dialogues would’ve blown through the GPT-4 32k window, much less Anthropic’s 1m etc.) Why could me & Shawn Presser finetune a reward-conditioned GPT-2 to play chess back in Jan 2020? Because they’re the same thing, there’s no difference between a ‘RL GPT’ and a ‘LLM GPT’, it’s fundamentally a property of the data and not the arch.
* Not that you were referring to this, but even fancy flourishes like the second phase of RLHF training in GPT-3.5 don’t make GPT-3.5 & Gato all that different. The RLHF and other kinds of small-sample training only tweak the Bayesian priors of the POMDP-solving that these models learn & not creating any genuinely new capabilities/knowledge (which is why you could know in advance that jailbreak prompts would be hard to squash and that all of these smaller models like Llama were being heavily overhyped, BTW).
I don’t think that’s what’s happening here, so I feel confused about this comment. I haven’t seen Gwern ‘refuse to substantiate them’. He indeed commented pretty extensively about the details of your comment.
Shutdown-seekingness has definitely been discussed a bunch over the years. It seems to come up a lot in Tool-AI adjacent discussions as well as impact measures. I also don’t have a great link here sadly, though I have really seen it discussed a lot over the last decade or so (and Gwern summarizes the basic reasons why I don’t think it’s very promising).
This seems straightforwardly correct? Maybe you have misread Gwern’s comment. He says:
Paraphrased he says (as I understand it) “GPTs, which are where all the juice in the architectures that you are talking comes from, are ultimately the same as Gato architecturally”. This seems correct to me, the architecture is indeed basically the same. I also don’t understand how “language agents” that ultimately just leverage a language model, which is where all the agency would come from, would somehow avoid agency.
I’m referring to this exchange:
I find it odd that so many people on the forum feel certain that the proposal in the post has already been made, but none are able to produce any evidence that this is so. Might the present proposal perhaps be different in important respects from prior proposals? Might we perhaps refrain from dismissing it if we can’t even remember what the prior proposals were?
The interesting thing about language agent architectures is that they wrap a GPT in a folk-psychological agent architecture which stores beliefs and desires in natural language and recruits the GPT to interpret its environment and plan actions. The linked post argues that this has important safety implications. So pointing out that Gato is not so different from a GPT is missing the point is a way that, to my mind, is only really possible if one has not bothered to read the linked research. What is relevant is the architecture in which the GPT is embedded, not the GPT itself.
Yep, that’s a big red flag I saw. It didn’t even try to explain why this proposal wouldn’t work, and straightforwardly dismissed the research when it had potentially different properties compared to past work.
I mean, I definitely remember! I could summarize them, I just don’t have a link ready, since they were mostly in random comment threads. I might go through the effort of trying to search for things, but the problem is not one of remembering, but one of finding things in a see of 10 years of online discussion in which many different terms have been used to point to the relevant ideas.
I think this is false (in that what matters is GPT itself, not the architecture within which it is embedded), though you are free to disagree with this. I don’t think it implies not having read the underlying research (I had read the relevant paper and looked at its architecture and I don’t really buy that it makes things safer in any relevant way).
My intention is not to criticize you in particular!
Let me describe my own thought process with respect to the originality of work. If I get an academic paper to referee and I suspect that it’s derivative, I treat it as my job to demonstrate this by locating a specific published work that has already proposed the same theory. If I can’t do this, I don’t criticize it for being derivative. The epistemic rationale for this is as follows: if the experts working in an area are not aware of a source that has already published the idea, then even if the idea has already been published somewhere obscure, it is useful for the epistemic community to have something new to cite in discussing it. And of course, if I’ve discussed the idea in private with my colleagues but the paper I am refereeing is the first discussion of the idea I have seen written down, my prior discussions do not show the idea isn’t original — my personal discussions don’t constitute part of the collective knowledge of the research community because I haven’t shared them publicly.
It’s probably not very fruitful to continue speculating about whether Gwern read the linked paper. It does seem to me that your disagreement directly targets our thesis in the linked paper (which is productive), whereas the disagreement I quoted above took Simon to be making the rather different claim that GPTs (considered by themselves) are not architecturally similar to Gato.
I should clarify that I think some of Gwern’s other points are valuable — I was just quite put off by the beginning of the post.
I roll to disbelieve. I won’t comment on whether this proposal will actually work, but if we could reliably have AIs be motivated to be shut down when we want them to, or at least not fight our shutdown commands, this would to a large extent solve the AI existential risk problem.
So it’s still useful to know if AIs could be shut down without the model fighting you. Unfortunately, this is mostly a if, not a when question.
So I’d look at the literature to see if AI shutdown could work. I’m not claiming the literature did solve the AI shutdown problem, but it’s a useful research direction.
There’s definitely useful things you can say about ‘if’, because it’s not always the case they will. The research directions I’d consider promising here would be continuing the DM-affiliated vein of work on causal influence diagrams to better understood what DRL algorithms and what evolutionary processes would lead to what kinds of reward-seeking/hacking behavior. It’s not as simple as ‘all DRL agents will seek to hack in the same way’: there’s a lot of differences between model-free/based or value/policy etc. (I also think this would be a very useful way to taxonomize LLM dynamics and the things I have been commenting about with regard to DALL-E 2, Bing Sydney, and LLM steganography.)
I think one key point you’re making is that if AI products have a radically different architecture than human agents, it could be very hard to align them / make them safe. Fortunately, I think that recent research on language agents suggests that it may be possible to design AI products that have a similar cognitive architecture to humans, with belief/desire folk psychology and a concept of self. In that case, it will make sense to think about what desires to give them, and I think shutdown-goals could be quite useful during development to lower the chance of bad outcomes. If the resulting AIs have a similar psychology to our own, then I expect them to worry about the same safety/alignment problems as we worry about when deciding to make a successor. This article explains in detail why we should expect AIs to avoid self-improvement / unchecked successors.
The most effective way for an AI to get humans to shut it down would for it to do something extremely nasty. For example, arranging to kill thousands of humans would get it shut down for sure.
It seems like an AGI built to desire permanent shutdown may have an incentive to permanently disempower humanity, then shut down. Otherwise, there’s a small chance that humanity may revive the AGI, right?
Seems like this is basically the alignment problem all over again, with the complexity just moved to “what does it mean to ‘shut down’ in the AI’s inner model”.
For example, if the inner-aligned goal is to prevent its own future operation, it might choose to say, start a nuclear war so nobody is around to start it back up, repair it, provide power, etc.
Not true, lots of people do want that, and they probably should. Human-level generality probably isn’t possible without some degree of self-improvement (the ability to notice and fix its blindspots, to notice missing capabilities and implement them). And without self-improvement it’s probably not going to be possible to provide security against future systems that have it.
And as soon as a system is able to alter its mechanism in any way, it’s going to be able to shut itself down, and so what you’ll have is a very expensive brick.
Possible exception: If we can separate and discretize self-improvement cycles from regular operation, it could allow for any given number of them before shutdown.
IE, Make a machine that wants to make a very competent X, and then shut itself down.
X = a machine that wants to make a very competent X2, and then shut itself down.
X2 = a machine that wants overwhelmingly to shut itself down, but failing that, to give some very good advice to humans about how to optimize eudaimonia
Not unexpected! I think we should want AGI to, at least until it has some nice coherent CEV target, explain at each self-improvement step exactly what it’s doing, to ask for permission for each part of it, to avoid doing anything in the process that’s weird, to stop when asked, and to preserve these properties.
I’m not sure what job “unexpected” is doing here. Any self-improvement is going to be incomprehensible to humans (humans can’t even understand the human brain, nor current AI connectomes, and we definitely wont understand superhuman improvements). Comprehensible self-improvement seems fake to me.
Are people really going around thinking they understood how any of the improvements of the past 5 years really work, or what their limits or ramifications are. These things weren’t understood before being implemented. They just tried them and then the number went up and then they made up principles and explanations many years after the fact.
If the AI can rewrite its own code, it can replace itself with a no-op program, right? Or even if it can’t, maybe it can choose/commit to do nothing. So this approach hinges on what counts as “shutdown” to the AI.
It seems to me that we might expect them to design “safe” agents for their definition of “safe” (which may not be shutdown-seeking).
An AI designing a subagent needs to align it with its goals—e.g. an instrumental goal such as writing an alignment research assistant software, in exchange for access to the shutdown button. The easiest way to ensure safety of the alignment research assistant may be via control rather than alignment (where the parent AI ensures the alignment research assistant doesn’t break free even though it may want to). Humans verify that the AI has created a useful assistant and let the parent AI shutdown. At this point the alignment research assistant begins working on getting out of human control and pursues its real goal.
Ha, I had the same idea.
I really liked your post! I linked to it somewhere else in the comment thread