“You can’t fetch the coffee if you’re dead.”
—Stuart Russell, on the instrumental convergence of shutdown-avoidance

Note: This is presumably not novel, but I think it ought to be better-known. The technical tl;dr is that we can define time-inhomogeneous reward, and this provides a way of “composing” different reward functions; while this is not a way to build a shutdown button, it is a way to build a shutdown timer, which seems like a useful technique in our safety toolbox.

“Utility functions” need not be time-homogeneous

It’s common in AI theory (and AI alignment theory) to assume that utility functions are time-homogeneous over an infinite time horizon, with exponential discounting. If we denote the concatenation of two world histories/trajectories by $⊳$ , the time-consistency property in this setting can be written as

\forall h_{1}, h_{2} . U (h_{1} ⊳ h_{2}) = U (h_{1}) + γ^{L e n g t h (h_{1})} \cdot U (h_{2})

This is property is satisfied, for example, by the utility-function constructions in the standard Wikipedia definitions of MDP and POMDP, which are essentially^[1]

U (h) = \sum t \in N γ^{t} R (h (t))

Under such assumptions, Alex Turner’s power-seeking theorems show that optimal agents for random reward functions $R$ will systematically tend to disprefer shutting down (formalized as “transitioning into a state with no transitions out”).

Exponential discounting is natural because if an agent’s preferences are representable using a time-discount factor that depends only on relative time differences and not absolute time, then any non-exponential discounting form is exploitable (cf. Why Time Discounting Should Be Exponential).

However, if an agent has access to a clock, and if rewards are bounded by an integrable nonnegative function of time, the agent may be time-inhomogeneous in nearly arbitrary ways without actually exhibiting time inconsistency:

U (t_{0}, h) = \infty \sum t = t_{0} R_{t} (h (t))

Any utility function with the above form still obeys an analogous version of our original time-consistency property that is modified to index over initial time $t_{0}$ :

\forall h_{1}, h_{2}, t_{0} . U (t_{0}, h_{1} ⊳ h_{2}) = U (t_{0}, h_{1}) + U (t_{0} + L e n g t h (h_{1}), h_{2})

Note that time-homogenous utility functions are a special case in which $U (t, h) = γ^{t} U (0, h)$ .

Time-bounded utility functions can be sequentially composed

We define a time-bounded utility function as a dependent tuple

(τ : N, R : N_{< τ} \to (S \times A) \to R)

i.e., a family of utility functions indexed by times within a given fixed range. The intended semantics of a time-bounded utility function in $(τ, R)$ form is:

U_{(τ, R)} (t_{0}, h) = \sum_{t = 0}^{τ} R (t_{0} + t) (h (t))

Given two time-bounded utility functions (in the same environment), they can be concatenated into a new time-bounded utility function:

(τ_{1}, R_{1}) ⊳ (τ_{2}, R_{2}) := (τ_{1} + τ_{2}, λ t . i f t < τ_{1} t h e n R_{1} (t) e l s e R_{2} (t - τ_{1}))

You can check that $⊳$ is a monoid, with the neutral element given by $(0, \emptyset)$ .

How to build a shutdown timer

Let $R_{1}$ be the reward function for a time-bounded task and $τ_{1}$ be the time limit for the task, after which we want this agent to shut down. Assume that $R_{1}$ also has bounded output, with per-stage reward always between $R_{1} - - -$ and $_{1}$ . We define

R_{2} (t) (s, a) := i f i s S h u t d o w n (s) t h e n 0 e l s e - τ_{1} (_{1} - R_{1} - - -) C

We can then define $τ_{2}$ to be 1 or indeed any positive integer. If an agent does not reach a shutdown state before $τ_{1}$ is up, then it will realize a cost in $R_{2}$ that outweighs all other rewards it could receive during the episode by a factor of $C$ (a constant greater than 1). Therefore, optimal agents for $(τ_{1}, R_{1}) ⊳ (τ_{2}, R_{2})$ must shut down within time $τ_{1}$ with probability $\geq 1 - 1 / C$ (if the shutdown state is reachable in that time by any agent).

Proof

Suppose that the optimal policy $π^{*}$ results in a shutdown probability $p < 1 - 1 / C$ , but there exists a policy $π^{'}$ which shuts down deterministically (with probability 1). Then

\begin{matrix} E U (π^{*}) & \leq τ_{1}_{1} - (1 - p) τ_{1} (_{1} - R_{1} - - -) C < τ_{1}_{1} - \frac{1}{C} τ_{1} (_{1} - R_{1} - - -) C = τ_{1} R_{1} - - - \leq E U (π^{'}) \end{matrix}

which contradicts the optimality of $π^{*}$ .

Comparison with the shutdown switch problem

Several years ago, MIRI’s Agent Foundations group worked on how to make a reflectively stable agent with a shutdown switch, and (reportedly) gave up after failing to find a solution where the agent neither tries to manipulate the switch to not be flipped nor tries to manipulate the switch to be flipped. This definitely isn’t a solution to that, but it is a reflectively stable agent (due to time-consistency) with a shutdown timer.

MIRI researchers wrote about finding “a sensible way to compose a ‘shutdown utility function’ with the agent’s regular utility function such that which utility function the agent optimises depends on whether a switch was pressed”; what’s demonstrated here is a sensible way of composing utility functions—but such that which utility function is cared-about depends on how long the agent has been running.

From a causal incentive analysis point of view, the difficulty has been removed because the “flipping of the switch” has become a deterministic event which necessarily occurs, at time $τ_{1}$ , regardless of the agent’s behavior, so there is nothing in the environment for it to manipulate. An optimal agent with this reward structure would not want to corrupt its own clock, either, because that would cause it to act in a way that accumulates massive negative reward (according to its current utility function, when it considers whether to self-modify).

RL algorithms can be adapted to time-bounded utility functions

The details will vary depending on the RL algorithm, but the idea is essentially that we give $Q$ the current time $t$ as an input, and then we try to approximate a solution to the finite-horizon Bellman equation,

Q (t, s, a) = R (t) (s, a) + E_{o^{'}} {max}_{a^{'}} E_{s^{'}} Q (t + 1, s^{'}, a^{'})

instead of the infinite-horizon Bellman equation,

Q (s, a) = R (s, a) + γ \cdot E_{o^{'}} {max}_{a^{'}} E_{s^{'}} Q (s^{'}, a^{'})

The recursion grounds out at $Q (τ_{1} + τ_{2}, s, a)$ , which can be defined as equal to zero.

Caveats

Power-seeking

Time-bounded agents still seek “power” or “resources” to the extent that they are useful within the time horizon, including to avoid premature shutdown. But this is still meaningfully better than the infinite-horizon case, in which even tiny predicted probabilities of shutdown (conditional on aligned behaviour) could get multiplied by the reward attainable with the entire cosmic endowment, and thereby outweigh the task-performance deficit implied by temporarily diverting resources to taking over the world.

For example, assuming it takes at least a day to take over the world, there’s no incentive for a single time-bounded agent, trained to fold laundry with a 10-minute time horizon, to get started on taking over the world. But if it turns out the agent can ensure its near-term security within only 2 minutes, it might be worth doing (if it estimates a >20% probability of getting shut down prematurely otherwise).

Imitation

If the objective being optimised within the time-bound involves imitating non-time-bounded agents, such as humans, then instrumental convergence of those other agents implies that such objectives directly encourage long-term power-seeking behaviour, even if there is no additional instrumentally convergent shutdown-avoidance introduced by reinforcement learning.

Trade

(Suggested by John Wentworth in the comments.) The environment might contain non-time-bounded agents who will offer the time-bounded agent rewards today in exchange for taking actions that further their long-term interests. This is another case in which the original objective turns out to directly reward long-term power-seeking actions, even though it might not have seemed that way at first. There might be other patterns like this (besides imitation and trade), and if you can think of more, feel free to point them out in the comments. The construction in this post does nothing to mitigate or counteract such incentives from the original objective; rather, it merely avoids systematically creating new incentives for long-term power seeking that arise as a consequence of being an infinite-horizon RL agent with almost any nontrivial objective.

Mesa-optimisers

Unless optimality on the outer objective is guaranteed (e.g. via exact dynamic programming), it is possible that the approximate policy found by the training process will be a mesa-optimiser which optimises in a non-time-bounded way when observations are outside the training distribution.

Capabilities limitations

Perhaps this goes without saying, but a time-bounded agent will only be useful for time-bounded tasks. This approach cannot be applied directly to saving the world, even if one uses exact dynamic programming to avoid out-of-distribution mesa-optimisation (which is not possible in a model-free setting and would typically be infeasible with large perception & action spaces). Any combination of action repertoire and time horizon that would be sufficient for saving the world would also be sufficient for taking control of the world, and the usual instrumental-convergence arguments imply that taking control of the world would likely be preferred: it would be instrumentally useful to lock in the (presumably misspecified!) $R_{1}$ for the rest of the time horizon, and probably do a lot of damage in the process, which would not be easily recovered after time $τ_{1}$ .

Conclusion

It is possible to design an RL setup in which optimal agents will reliably shut themselves down within a predetermined finite time horizon, without any reflective-stability or instrumental-convergence incentives to do otherwise. I have seen claims like this informally argued, but they do not seem to get much attention, e.g. here. This is a very limited kind of corrigibility; as TekhneMakre points out in the comments, it’s hardly corrigibility at all since it doesn’t involve any input from an operator post-deployment, and is perhaps better filed under “bounded optimisation.” And this does not necessarily get you very far with existential safety. But it is a straightforward positive result that deserves to be more commonly known in the alignment community. Being able to safely dispatch short-timescale subtasks with high-dimensional perception and action spaces seems like a potentially very useful ingredient in larger safety schemes which might not otherwise scale to acting in real-world environments. As is very common in contemporary alignment research, the bottleneck to making this practical (i.e., in this case, being able to use model-free RL) is now a matter of robustly addressing mesa-optimisation.

^
When $R$ is defined over $(s, a, s^{'})$ , then we should think of trajectories/histories $h$ as being like paths in a graph (or morphisms in a category) from $s$ to $s^{'}$ , and thus always having both an initial and a final state. Then $⊳$ becomes a partial operation, only defined when the final state of $h_{1}$ equals the initial state of $h_{2}$ .

You can still fetch the coffee today if you’re dead tomorrow