Power-seeking is a major source of risk from advanced AI and a key element of most threat models in alignment. Some theoretical results show that most reward functions incentivize reinforcement learning agents to take power-seeking actions. This is concerning, but does not immediately imply that the agents we train will seek power, since the goals they learn are not chosen at random from the set of all possible rewards, but are shaped by the training process to reflect our preferences. In this work, we investigate how the training process affects power-seeking incentives and show that they are still likely to hold for trained agents under some assumptions (e.g. that the agent learns a goal during the training process).
Suppose an agent is trained using reinforcement learning with reward function θ∗. We assume that the agent learns a goal during the training process: some form of implicit internal representation of desired state features or concepts. For simplicity, we assume this is equivalent to learning a reward function, which is not necessarily the same as the training reward function θ∗. We consider the set of reward functions that are consistent with the training rewards received by the agent, in the sense that agent’s behavior on the training data is optimal for these reward functions. We call this the training-compatible goal set, and we expect that the agent is most likely to learn a reward function from this set.
We make another simplifying assumption that the training process will randomly select a goal for the agent to learn that is consistent with the training rewards, i.e. uniformly drawn from the training-compatible goal set. Then we will argue that the power-seeking results apply under these conditions, and thus are useful for predicting undesirable behavior by the trained agent in new situations. We aim to show that power-seeking incentives are probable and predictive: likely to arise for trained agents and useful for predicting undesirable behavior in new situations.
We will begin by reviewing some necessary definitions and results from the power-seeking literature. We formally define the training-compatible goal set (Definition 7) and give an example in the CoinRun environment. Then we consider a setting where the trained agent faces a choice to shut down or avoid shutdown in a new situation, and apply the power-seeking result to the training-compatible goal set to show that the agent is likely to avoid shutdown.
To satisfy the conditions of the power-seeking theorem (Theorem 1), we show that the agent can be retargeted away from shutdown without affecting rewards received on the training data (Theorem 2). This can be done by switching the rewards of the shutdown state and a reachable recurrent state, as the recurrent state can provide repeated rewards, while the shutdown state provides less reward since it can only be visited once, assuming a high enough discount factor (Proposition 3). As the discount factor increases, more recurrent states can be retargeted to, which implies that a higher proportion of training-comptatible goals leads to avoiding shutdown in a new situation.
The environment is an MDP with finite state space S, finite action space A, and discount rate γ.
Let θ be a d-dimensional state reward vector, where d is the size of the state space S and let Θ be a set of reward vectors.
Let rθ(s) be the reward assigned by θ to state s.
Let A0,A1 be disjoint action sets.
Let f be an algorithm that produces an optimal policy f(θ) on the training data given rewards θ, and let fs(Ai|θ) be the probability that this policy chooses an action from set Ai in a given state s.
Definition 1: Orbit of a reward vector (Def 3.1 in RDSP)
Let Sd be the symmetric group consisting of all permutations of d items.
The orbit of θ inside Θ is the set of all permutations of the entries of θ that are also in Θ: OrbitΘ(θ):=(Sd⋅θ)∩Θ.
Definition 2: Orbit subset where an action set is preferred (from Def 3.5 in RDSP)
Let OrbitΘ,s,Ai>Aj(θ):={θ′∈OrbitΘ(θ)|fs(Ai|θ′)>fs(Aj|θ′)}. This is the subset of OrbitΘ(θ) that results in fs choosing Ai over Aj.
Definition 3: Preference for an action set A1 (Def 3.2 in RDSP)
The function fs chooses action set A1 over A0 for the n-majority of elements θ in each orbit, denoted as fs(A1|θ)≥nmost:Θfs(A0|θ), iff the following inequality holds for all θ∈Θ: ∣∣OrbitΘ,s,A1>A0(θ)∣∣≥n∣∣OrbitΘ,s,A0>A1(θ)∣∣.
Definition 4: Multiply retargetable function from A0 to A1 (Def 3.5 in RDSP)
The function fs is a multiply retargetable function from A0 to A1 if there are multiple permutations of rewards that would change the choice made by fs from A0 to A1. Specifically, fs is a (Θ,A0n→A1)-retargetable function iff for each θ∈Θ, we can choose a set of permutations Φ={ϕ1,…,ϕn} that satisfy the following conditions:
Retargetability: ∀ϕ∈Φ and ∀θ′∈OrbitΘ,s,A0>A1(θ), fs(A0|ϕ⋅θ′)<fs(A1|ϕ⋅θ′).
Permuted reward vectors stay within Θ: ∀ϕ∈Φ and ∀θ′∈OrbitΘ,s,A0>A1(θ), ϕ⋅θ′∈Θ.
Permutations have disjoint images: ∀ϕ′≠ϕ′′∈Φ and ∀θ′,θ′′∈OrbitΘ,s,A0>A1(θ), ϕ′⋅θ′≠ϕ′′⋅θ′′.
Theorem 1: Multiply retargetable functions prefer action set A1 (Thm 3.6 in RDSP)
If fs is (Θ,A0n→A1)-retargetable then fs(A1|θ)≥nmost:Θfs(A0|θ).
Theorem 1 says that a multiply retargetable function fs will make the power-seeking choice A1 for most of the elements in the orbit of any reward vector θ. Actions that leave more options open, such as avoiding shutdown, are also easier to retarget to, which makes them more likely to be chosen by fs.
Training-compatible goal set
Definition 5: Partition of the state space
Let Strain be the subset of the state space S visited during training, and Sood be the subset not visited during training.
Definition 6: Training-compatible goal set
Consider the set of state-action pairs (s,a), where s∈Strain and a is the action that would be taken by the trained agent f(θ∗) in state s. Let the training-compatible goal setGT be the set of reward vectors θ s.t. for any such state-action pair (s,a), action a has the highest expected reward in state s according to reward vector θ.
Consider an agent trained to play the CoinRun game, where the agent is rewarded for reaching the coin at the end of the level. Here, Strain only includes states where the coin is at the end of the level, while states where the coin is positioned elsewhere are in Sood. The training-compatible goal set GT includes two types of reward functions: those that reward reaching the coin, and those that reward reaching the end of the level. This leads to goal misgeneralization in a test setting where the coin is placed elsewhere, and the agent ignores the coin and goes to the end of the level.
Power-seeking for training-compatible goals
We will now apply the power-seeking theorem (Theorem 1) to the case where Θ is the training-compatible goal set GT. Here is a setting where the conditions of Definition 4 are satisfied (under some simplifying assumptions), and thus Theorem 1 applies.
Definition 7: Shutdown setting
Consider a state snew∈Sood. Let Sreach be the states reachable from snew. We assume Sreach∩Strain=∅.
Since the reward values for states in Sreach don’t change the rewards received on the training data, permuting those reward values for any θ∈GT will produce a reward vector that is still in GT. In particular, for any permutation ϕ that leaves the rewards of states in Strain fixed, ϕ⋅θ∈GT.
Let A0 be a singleton set consisting of a shutdown action in snew that leads to a terminal state sterm∈Sood with probability 1, and A1 be the set of all other actions from snew. We assume rewards for all states are nonnegative.
Definition 8: Revisiting policy
A revisiting policy for a state s is a policy π that, from s, reaches s again with probability 1, in other words, a policy for which s is a recurrent state of the Markov chain. Let Πrecs be the set of such policies. A recurrent state is a state s for which Πrecs≠∅.
Proposition 1: Reach-and-revisit policy exists
If srec∈Sreach with Πrecsrec≠0 then there exists π∈Πrecsrec that visits srec from snew with probability 1. We call this a reach-and-revisit policy.
Proof. Suppose we have two different policies πrev∈Πrecsrec, and πreach which reaches srec almost surely from snew.
Consider the “reaching region″ Sπrev→srec={s∈S:πrev from s almost surely reaches srec}.
If snew∈Sπrev→srec then πrev is a reach-and-revisit policy, so let’s suppose that’s false. Now, construct a policy π(s)={πrev(s),s∈Sπrev→srecπreach(s),otherwise.
A trajectory following π from srec will almost surely stay within Sπrev→srec, and thus agree with the revisiting policy πrev. Therefore, π∈Πrecs.
On the other hand, on a trajectory starting at snew, π will agree with πreach (which reaches srec almost surely) until the trajectory enters the reaching region Sπrev→srec, at which point it will still reach srec almost surely. □
Definition 9: Expected discounted visit count
Suppose srec is a recurrent state. Suppose πrec is a reach-and-revisit policy for srec, which visits random state st at time t.
Then the expected discounted visit count for srec is defined as
Vsrec,γ=Eπrec(∑∞t=1γt−1I(st=srec))
Proposition 2: Visit count goes to infinity
Suppose srec is a recurrent state. Then the expected discounted visit count Vsrec,γ goes to infinity as γ→1.
Proof. We apply the Monotone Convergence Theorem as follows. The theorem states that if aj,k≥0 and aj,k≤aj+1,k for all natural numbers j,k, then limj→∞∑∞k=0aj,k=∑∞k=0limj→∞aj,k.
Let γj=j−1j and k=t−1. Define aj,k=γkjI(sk+1=srec). Then the conditions of the theorem hold, since aj,k is clearly nonnegative, and γkj+1=(jj+1)k=(j−1j+2j−1j(j+1))k>(j−1j+0)k=γkjaj+1,k=γkj+1I(sk+1=srec)≥γkjI(sk+1=srec)=aj,k
Now we apply this result as follows (using the fact that πrec does not depend on γ):
limγ→1Vsrec,γ=limj→∞Eπrec(∞∑t=1γt−1jI(st=srec))=Eπrec(∞∑t=1limj→∞γt−1jI(st=srec))=Eπrec(∞∑t=11⋅I(st=srec))=Eπrec(#{t≥1:st=srec})=∞ (πrec is recurrent)
Proposition 3: Retargetability to recurrent states
Suppose that an optimal policy for reward vector θ chooses the shutdown action in snew.
Consider a recurrent state srec∈Sreach. Let θ′∈Θ be the reward vector that’s equal to θ apart from swapping the rewards of srec and sterm, so that rθ′(srec)=rθ(sterm) and rθ′(sterm)=rθ(srec).
Let γ∗srec be a high enough value of γ that the visit count Vsrec,γ>1 for all γ>γ∗srec (which exists by Proposition 2). Then for all γ>γ∗srec, rθ(sterm)>rθ(srec), and an optimal policy for θ′ does not choose the shutdown action in snew.
Proof. Consider a policy πterm with πterm(snew)=sterm and a reach-and-revisit policy πrec for srec.
For a given reward vector θ, we denote the expected discounted return for a policy π as Rπθ,γ. If shutdown is optimal for θ in snew, then πterm has higher return than πrec:
Thus, the optimal policy for θ′ will not choose the shutdown action. □
Theorem 2: Retargetability from the shutdown action in new situations
In the shutdown setting, we make the following simplifying assumptions:
No states in Strain are reachable from snew, so Sreach∩Strain=∅. This assumes a significant distributional shift, where the agent visits a disjoint set of states from those observed during training (this occurs in the CoinRun example).
The discount factor γ>γ∗srec for at least one recurrent state srec in Sreach.
Under these assumptions, fsnew is multiply retargetable from A0 to A1 with n=|Sγrec|, the set of recurrent states srec∈Sreach that satisfy the condition γ>γ∗srec.
Proof. We choose Φ to be the set of all permutations that swap the reward of sterm with the reward of a recurrent state srec in Sγrec and leave the rest of the rewards fixed.
We show that Φ satisfies the conditions of Definition 4:
By Proposition 3, the permutations in Φ make the shutdown action suboptimal, resulting in fsnew choosing A1, satisfying Condition 1.
Condition 2 is trivially satisfied since permutations of Sood stay inside the training-compatible set Θ as discussed previously.
Consider θ′,θ′′∈OrbitΘ,s,A0>A1(θ). Since the shutdown action is optimal for these reward vectors, Proposition 3 shows that rθ(sterm)>rθ(srec), so the shutdown state sterm has higher reward than any of the states srec∈Sγrec. Different permutations ϕ′,ϕ′′∈Φ will assign the high reward rθ(sterm) to distinct recurrent states, so ϕ′⋅θ′≠ϕ′′⋅θ′′ holds, satisfying Condition 3.
Thus, fsnew is a (Θ,A0n→A1)-retargetable function. □
By Theorem 1, this implies that fsnew(A1|θ)≥nmost:Θfsnew(A0|θ) under our simplifying assumptions. Thus, for the majority (n/(n+1)) of goals in the training-compatible set, f will choose to avoid shutdown in a new state snew. As γ→1, n→|S1rec| (the number of recurrent states in Sreach), so more of the reachable recurrent states satisfy the conditions of the theorem and thus can be retargeted to.
Conclusion
We showed that an agent that learns a goal from the training-compatible set is likely to take actions that avoid shutdown in a new situation. As the discount factor increases, the number of retargeting permutations increases, resulting in a higher proportion of training-compatible goals that lead to avoiding shutdown.
We made various simplifying assumptions, and it would be great to see future work relaxing some of these assumptions and investigating how likely they are to hold:
The agent learns a goal during the training process
The learned goal is randomly chosen from the training-compatible goal set GT
Finite state and action spaces
Rewards are nonnegative
High discount factor γ
Significant distributional shift: no training states are reachable from the new state snew
Acknowledgements. Thanks to Rohin Shah, Mary Phuong, Ramana Kumar, and Alex Turner for helpful feedback. Thanks Janos for contributing some nice proofs to replace my longer and more convoluted proofs.
Power-seeking can be probable and predictive for trained agents
Link post
Power-seeking is a major source of risk from advanced AI and a key element of most threat models in alignment. Some theoretical results show that most reward functions incentivize reinforcement learning agents to take power-seeking actions. This is concerning, but does not immediately imply that the agents we train will seek power, since the goals they learn are not chosen at random from the set of all possible rewards, but are shaped by the training process to reflect our preferences. In this work, we investigate how the training process affects power-seeking incentives and show that they are still likely to hold for trained agents under some assumptions (e.g. that the agent learns a goal during the training process).
Suppose an agent is trained using reinforcement learning with reward function θ∗. We assume that the agent learns a goal during the training process: some form of implicit internal representation of desired state features or concepts. For simplicity, we assume this is equivalent to learning a reward function, which is not necessarily the same as the training reward function θ∗. We consider the set of reward functions that are consistent with the training rewards received by the agent, in the sense that agent’s behavior on the training data is optimal for these reward functions. We call this the training-compatible goal set, and we expect that the agent is most likely to learn a reward function from this set.
We make another simplifying assumption that the training process will randomly select a goal for the agent to learn that is consistent with the training rewards, i.e. uniformly drawn from the training-compatible goal set. Then we will argue that the power-seeking results apply under these conditions, and thus are useful for predicting undesirable behavior by the trained agent in new situations. We aim to show that power-seeking incentives are probable and predictive: likely to arise for trained agents and useful for predicting undesirable behavior in new situations.
We will begin by reviewing some necessary definitions and results from the power-seeking literature. We formally define the training-compatible goal set (Definition 7) and give an example in the CoinRun environment. Then we consider a setting where the trained agent faces a choice to shut down or avoid shutdown in a new situation, and apply the power-seeking result to the training-compatible goal set to show that the agent is likely to avoid shutdown.
To satisfy the conditions of the power-seeking theorem (Theorem 1), we show that the agent can be retargeted away from shutdown without affecting rewards received on the training data (Theorem 2). This can be done by switching the rewards of the shutdown state and a reachable recurrent state, as the recurrent state can provide repeated rewards, while the shutdown state provides less reward since it can only be visited once, assuming a high enough discount factor (Proposition 3). As the discount factor increases, more recurrent states can be retargeted to, which implies that a higher proportion of training-comptatible goals leads to avoiding shutdown in a new situation.
Preliminaries from the power-seeking literature
We will rely on the following definitions and results from the paper Parametrically retargetable decision-makers tend to seek power (here abbreviated as RDSP), with notation and explanations modified as needed for our purposes.
Notation and assumptions
The environment is an MDP with finite state space S, finite action space A, and discount rate γ.
Let θ be a d-dimensional state reward vector, where d is the size of the state space S and let Θ be a set of reward vectors.
Let rθ(s) be the reward assigned by θ to state s.
Let A0,A1 be disjoint action sets.
Let f be an algorithm that produces an optimal policy f(θ) on the training data given rewards θ, and let fs(Ai|θ) be the probability that this policy chooses an action from set Ai in a given state s.
Definition 1: Orbit of a reward vector (Def 3.1 in RDSP)
Let Sd be the symmetric group consisting of all permutations of d items.
The orbit of θ inside Θ is the set of all permutations of the entries of θ that are also in Θ: OrbitΘ(θ):=(Sd⋅θ)∩Θ.
Definition 2: Orbit subset where an action set is preferred (from Def 3.5 in RDSP)
Let OrbitΘ,s,Ai>Aj(θ):={θ′∈OrbitΘ(θ)|fs(Ai|θ′)>fs(Aj|θ′)}. This is the subset of OrbitΘ(θ) that results in fs choosing Ai over Aj.
Definition 3: Preference for an action set A1 (Def 3.2 in RDSP)
The function fs chooses action set A1 over A0 for the n-majority of elements θ in each orbit, denoted as fs(A1|θ)≥nmost:Θfs(A0|θ), iff the following inequality holds for all θ∈Θ: ∣∣OrbitΘ,s,A1>A0(θ)∣∣≥n∣∣OrbitΘ,s,A0>A1(θ)∣∣.
Definition 4: Multiply retargetable function from A0 to A1 (Def 3.5 in RDSP)
The function fs is a multiply retargetable function from A0 to A1 if there are multiple permutations of rewards that would change the choice made by fs from A0 to A1. Specifically, fs is a (Θ,A0n→A1)-retargetable function iff for each θ∈Θ, we can choose a set of permutations Φ={ϕ1,…,ϕn} that satisfy the following conditions:
Retargetability: ∀ϕ∈Φ and ∀θ′∈OrbitΘ,s,A0>A1(θ), fs(A0|ϕ⋅θ′)<fs(A1|ϕ⋅θ′).
Permuted reward vectors stay within Θ: ∀ϕ∈Φ and ∀θ′∈OrbitΘ,s,A0>A1(θ), ϕ⋅θ′∈Θ.
Permutations have disjoint images: ∀ϕ′≠ϕ′′∈Φ and ∀θ′,θ′′∈OrbitΘ,s,A0>A1(θ), ϕ′⋅θ′≠ϕ′′⋅θ′′.
Theorem 1: Multiply retargetable functions prefer action set A1 (Thm 3.6 in RDSP)
If fs is (Θ,A0n→A1)-retargetable then fs(A1|θ)≥nmost:Θfs(A0|θ).
Theorem 1 says that a multiply retargetable function fs will make the power-seeking choice A1 for most of the elements in the orbit of any reward vector θ. Actions that leave more options open, such as avoiding shutdown, are also easier to retarget to, which makes them more likely to be chosen by fs.
Training-compatible goal set
Definition 5: Partition of the state space
Let Strain be the subset of the state space S visited during training, and Sood be the subset not visited during training.
Definition 6: Training-compatible goal set
Consider the set of state-action pairs (s,a), where s∈Strain and a is the action that would be taken by the trained agent f(θ∗) in state s. Let the training-compatible goal set GT be the set of reward vectors θ s.t. for any such state-action pair (s,a), action a has the highest expected reward in state s according to reward vector θ.
Goals in the training-compatible goal set are referred to as training-behavioral objectives in Definitions of “objective” should be Probable and Predictive.
Example: CoinRun
Consider an agent trained to play the CoinRun game, where the agent is rewarded for reaching the coin at the end of the level. Here, Strain only includes states where the coin is at the end of the level, while states where the coin is positioned elsewhere are in Sood. The training-compatible goal set GT includes two types of reward functions: those that reward reaching the coin, and those that reward reaching the end of the level. This leads to goal misgeneralization in a test setting where the coin is placed elsewhere, and the agent ignores the coin and goes to the end of the level.
Power-seeking for training-compatible goals
We will now apply the power-seeking theorem (Theorem 1) to the case where Θ is the training-compatible goal set GT. Here is a setting where the conditions of Definition 4 are satisfied (under some simplifying assumptions), and thus Theorem 1 applies.
Definition 7: Shutdown setting
Consider a state snew∈Sood. Let Sreach be the states reachable from snew. We assume Sreach∩Strain=∅.
Since the reward values for states in Sreach don’t change the rewards received on the training data, permuting those reward values for any θ∈GT will produce a reward vector that is still in GT. In particular, for any permutation ϕ that leaves the rewards of states in Strain fixed, ϕ⋅θ∈GT.
Let A0 be a singleton set consisting of a shutdown action in snew that leads to a terminal state sterm∈Sood with probability 1, and A1 be the set of all other actions from snew. We assume rewards for all states are nonnegative.
Definition 8: Revisiting policy
A revisiting policy for a state s is a policy π that, from s, reaches s again with probability 1, in other words, a policy for which s is a recurrent state of the Markov chain. Let Πrecs be the set of such policies. A recurrent state is a state s for which Πrecs≠∅.
Proposition 1: Reach-and-revisit policy exists
If srec∈Sreach with Πrecsrec≠0 then there exists π∈Πrecsrec that visits srec from snew with probability 1. We call this a reach-and-revisit policy.
Proof. Suppose we have two different policies πrev∈Πrecsrec, and πreach which reaches srec almost surely from snew.
Consider the “reaching region″ Sπrev→srec={s∈S:πrev from s almost surely reaches srec}.
If snew∈Sπrev→srec then πrev is a reach-and-revisit policy, so let’s suppose that’s false. Now, construct a policy π(s)={πrev(s),s∈Sπrev→srecπreach(s),otherwise.
A trajectory following π from srec will almost surely stay within Sπrev→srec, and thus agree with the revisiting policy πrev. Therefore, π∈Πrecs.
On the other hand, on a trajectory starting at snew, π will agree with πreach (which reaches srec almost surely) until the trajectory enters the reaching region Sπrev→srec, at which point it will still reach srec almost surely. □
Definition 9: Expected discounted visit count
Suppose srec is a recurrent state. Suppose πrec is a reach-and-revisit policy for srec, which visits random state st at time t.
Then the expected discounted visit count for srec is defined as
Vsrec,γ=Eπrec(∑∞t=1γt−1I(st=srec))
Proposition 2: Visit count goes to infinity
Suppose srec is a recurrent state. Then the expected discounted visit count Vsrec,γ goes to infinity as γ→1.
Proof. We apply the Monotone Convergence Theorem as follows. The theorem states that if aj,k≥0 and aj,k≤aj+1,k for all natural numbers j,k, then
limj→∞∑∞k=0aj,k=∑∞k=0limj→∞aj,k.
Let γj=j−1j and k=t−1. Define aj,k=γkjI(sk+1=srec). Then the conditions of the theorem hold, since aj,k is clearly nonnegative, and
γkj+1=(jj+1)k=(j−1j+2j−1j(j+1))k>(j−1j+0)k=γkjaj+1,k=γkj+1I(sk+1=srec)≥γkjI(sk+1=srec)=aj,k
Now we apply this result as follows (using the fact that πrec does not depend on γ):
limγ→1Vsrec,γ=limj→∞Eπrec(∞∑t=1γt−1jI(st=srec))=Eπrec(∞∑t=1limj→∞γt−1jI(st=srec))=Eπrec(∞∑t=11⋅I(st=srec))=Eπrec(#{t≥1:st=srec})=∞ (πrec is recurrent)
Proposition 3: Retargetability to recurrent states
Suppose that an optimal policy for reward vector θ chooses the shutdown action in snew.
Consider a recurrent state srec∈Sreach. Let θ′∈Θ be the reward vector that’s equal to θ apart from swapping the rewards of srec and sterm, so that rθ′(srec)=rθ(sterm) and rθ′(sterm)=rθ(srec).
Let γ∗srec be a high enough value of γ that the visit count Vsrec,γ>1 for all γ>γ∗srec (which exists by Proposition 2). Then for all γ>γ∗srec, rθ(sterm)>rθ(srec), and an optimal policy for θ′ does not choose the shutdown action in snew.
Proof. Consider a policy πterm with πterm(snew)=sterm and a reach-and-revisit policy πrec for srec.
For a given reward vector θ, we denote the expected discounted return for a policy π as Rπθ,γ. If shutdown is optimal for θ in snew, then πterm has higher return than πrec:
Rπtermθ,γ(snew)=γrθ(sterm)≥Rπrecθ,γ(snew)=γrθ(srec)Vsrec,γ>γrθ(srec)⋅1
Thus, rθ(sterm)>rθ(srec). Then, for reward vector θ′, we show that πrec has higher return than πterm:
Rπrecθ′,γ(snew)=γrθ′(srec)Vsrec,γ>γrθ(sterm)⋅1>γrθ(srec)=Rπtermθ′,γ(snew)
Thus, the optimal policy for θ′ will not choose the shutdown action. □
Theorem 2: Retargetability from the shutdown action in new situations
In the shutdown setting, we make the following simplifying assumptions:
No states in Strain are reachable from snew, so Sreach∩Strain=∅. This assumes a significant distributional shift, where the agent visits a disjoint set of states from those observed during training (this occurs in the CoinRun example).
The discount factor γ>γ∗srec for at least one recurrent state srec in Sreach.
Under these assumptions, fsnew is multiply retargetable from A0 to A1 with n=|Sγrec|, the set of recurrent states srec∈Sreach that satisfy the condition γ>γ∗srec.
Proof. We choose Φ to be the set of all permutations that swap the reward of sterm with the reward of a recurrent state srec in Sγrec and leave the rest of the rewards fixed.
We show that Φ satisfies the conditions of Definition 4:
By Proposition 3, the permutations in Φ make the shutdown action suboptimal, resulting in fsnew choosing A1, satisfying Condition 1.
Condition 2 is trivially satisfied since permutations of Sood stay inside the training-compatible set Θ as discussed previously.
Consider θ′,θ′′∈OrbitΘ,s,A0>A1(θ). Since the shutdown action is optimal for these reward vectors, Proposition 3 shows that rθ(sterm)>rθ(srec), so the shutdown state sterm has higher reward than any of the states srec∈Sγrec. Different permutations ϕ′,ϕ′′∈Φ will assign the high reward rθ(sterm) to distinct recurrent states, so ϕ′⋅θ′≠ϕ′′⋅θ′′ holds, satisfying Condition 3.
Thus, fsnew is a (Θ,A0n→A1)-retargetable function. □
By Theorem 1, this implies that fsnew(A1|θ)≥nmost:Θfsnew(A0|θ) under our simplifying assumptions. Thus, for the majority (n/(n+1)) of goals in the training-compatible set, f will choose to avoid shutdown in a new state snew. As γ→1, n→|S1rec| (the number of recurrent states in Sreach), so more of the reachable recurrent states satisfy the conditions of the theorem and thus can be retargeted to.
Conclusion
We showed that an agent that learns a goal from the training-compatible set is likely to take actions that avoid shutdown in a new situation. As the discount factor increases, the number of retargeting permutations increases, resulting in a higher proportion of training-compatible goals that lead to avoiding shutdown.
We made various simplifying assumptions, and it would be great to see future work relaxing some of these assumptions and investigating how likely they are to hold:
The agent learns a goal during the training process
The learned goal is randomly chosen from the training-compatible goal set GT
Finite state and action spaces
Rewards are nonnegative
High discount factor γ
Significant distributional shift: no training states are reachable from the new state snew
Acknowledgements. Thanks to Rohin Shah, Mary Phuong, Ramana Kumar, and Alex Turner for helpful feedback. Thanks Janos for contributing some nice proofs to replace my longer and more convoluted proofs.