Delegative Reinforcement Learning is safe not just asymptotically. See also this, this and (once it’s uploaded) upcoming paper for SafeML 2019. In addition, there are directions for further improvment here in the “value learning protocols” sections.
The agent interacts with an environment, that is for the time being assumed to be a finite MDP (generalizations to POMDP and infinite state spaces should be possible, but working out the precise assumptions that are needed is currently an open problem). On each round it either takes a normal action from the set A or takes the special “delegation” action ⊥. If the agent delegates, the advisor produces an action from A that acts on the environment instead.
The assumptions on the advisor are: (i) it never falls into traps (or enters corrupt states, which means states in which the advisor and/or the input channels were compromised and longer provide reliable rewards or advice) (ii) it has at least some small probability of taking the optimal action (instead, we could assume that there is some set of “good enough” actions s.t. the advisor has at least small small probability to take such an action, and reformulate the guarantee w.r.t. the best policy comprised of “good enough” actions rather than the fully optimal policy).
Under these assumptions, we have a regret bound (the particular algorithm I use to prove the bound is Thompson sampling where (i) the agent delegates when it’s not sure than an action is safe and (ii) hypotheses with low probability are discarded), meaning that as the geometric time discount constant goes to 1, the agent achieves nearly optimal expected utility.
Here I generalize the setup to allow a small probability of losing long-term value or entering a corrupt state when following the advisor policy. This is important because the aligned AGI is supposed to, among other things, block any unaligned AGI and this is something that the advisor cannot do on its own. I envision more ways to further “soften” the assumptions, in particular we can use the same method as in quantilizers, and argue that if the advisor policy loses long-term value very slowly then any policy with sufficiently small Renyi divergence w.r.t. the advisor policy also loses long-term value slowly at most. The agent should then be able to converge to the optimal policy under the Renyi divergence constraint. (Intuitively, we constraint the agent to behavior that is sufficiently “human like”.) This should also have the benefit of a continuous rather than discrete model of corruption (that covers e.g. gradual value drift).
So the AI only takes action a from state s if it has already seen the human do that? If so, that seems like the root of all the safety guarantees to me.
Not quite. The AI starts with some prior over (environment, advisor policy) pairs and updates it with incoming observations. It can take an action if, given its current belief state, it is sufficiently confident that it is an action the advisor could take. The confidence threshold is controlled by the parameter η which has a certain optimal value to achieve the best regret bound (as γ→1, η→0; in other words, the more long-term the plan is, the more cautious the AI becomes; obviously catastrophes modify this trade-off). That is, the AI generalizes from what it already observed rather than requiring the exact same state to repeat itself. Indeed, if we required the exact same state to repeat itself, the regret bound would scale with the number of states. Instead, it scales with the number of hypotheses (of course we can also derive a “structural” / “non-uniform” version for a countable number of hypotheses). Also, I am pretty sure that we can derive a regret bound that scales with RVO and MB dimensions (I also think MB dimension can be replaced by prior entropy, but so far hasn’t been able to prove it), which can be bounded either in terms of the number of hypotheses or in terms of the number of states and actions, and can also remain small when both the number of hypotheses and the number of states are large.
Another useful perspective on the conditions the advisor must satisfy, is regarding the environment w.r.t. which these conditions are defined as the belief state of the advisor rather than the true environment. This is difficult to do with the current formalism that requires MDPs, but would be possible with POMDPs for example. Indeed, I took this perspective in an earlier essay about a different setting that allows general environments (see Corollary 1 in that essay). This would lead to a performance guarantee which shows that the agent achieves optimal expected utility w.r.t. the belief state of the advisor. Obviously, this is not as good as optimal expected utility w.r.t. the true environment, however, this means that from the perspective of the advisor, building such an agent is the best possible strategy.
Can you add the key assumptions being made when you say it is safe asymptotically? From skimming, it looked like “assuming the world is an MDP and that a human can recognize which actions lead to catastrophes.”
Delegative Reinforcement Learning is safe not just asymptotically. See also this, this and (once it’s uploaded) upcoming paper for SafeML 2019. In addition, there are directions for further improvment here in the “value learning protocols” sections.
I have to admit I got a little swamped by unfamiliar notation. Can you give me a short description of a Delegative Reinforcement Learner?
The agent interacts with an environment, that is for the time being assumed to be a finite MDP (generalizations to POMDP and infinite state spaces should be possible, but working out the precise assumptions that are needed is currently an open problem). On each round it either takes a normal action from the set A or takes the special “delegation” action ⊥. If the agent delegates, the advisor produces an action from A that acts on the environment instead.
The assumptions on the advisor are: (i) it never falls into traps (or enters corrupt states, which means states in which the advisor and/or the input channels were compromised and longer provide reliable rewards or advice) (ii) it has at least some small probability of taking the optimal action (instead, we could assume that there is some set of “good enough” actions s.t. the advisor has at least small small probability to take such an action, and reformulate the guarantee w.r.t. the best policy comprised of “good enough” actions rather than the fully optimal policy).
Under these assumptions, we have a regret bound (the particular algorithm I use to prove the bound is Thompson sampling where (i) the agent delegates when it’s not sure than an action is safe and (ii) hypotheses with low probability are discarded), meaning that as the geometric time discount constant goes to 1, the agent achieves nearly optimal expected utility.
Here I generalize the setup to allow a small probability of losing long-term value or entering a corrupt state when following the advisor policy. This is important because the aligned AGI is supposed to, among other things, block any unaligned AGI and this is something that the advisor cannot do on its own. I envision more ways to further “soften” the assumptions, in particular we can use the same method as in quantilizers, and argue that if the advisor policy loses long-term value very slowly then any policy with sufficiently small Renyi divergence w.r.t. the advisor policy also loses long-term value slowly at most. The agent should then be able to converge to the optimal policy under the Renyi divergence constraint. (Intuitively, we constraint the agent to behavior that is sufficiently “human like”.) This should also have the benefit of a continuous rather than discrete model of corruption (that covers e.g. gradual value drift).
So the AI only takes action a from state s if it has already seen the human do that? If so, that seems like the root of all the safety guarantees to me.
Not quite. The AI starts with some prior over (environment, advisor policy) pairs and updates it with incoming observations. It can take an action if, given its current belief state, it is sufficiently confident that it is an action the advisor could take. The confidence threshold is controlled by the parameter η which has a certain optimal value to achieve the best regret bound (as γ→1, η→0; in other words, the more long-term the plan is, the more cautious the AI becomes; obviously catastrophes modify this trade-off). That is, the AI generalizes from what it already observed rather than requiring the exact same state to repeat itself. Indeed, if we required the exact same state to repeat itself, the regret bound would scale with the number of states. Instead, it scales with the number of hypotheses (of course we can also derive a “structural” / “non-uniform” version for a countable number of hypotheses). Also, I am pretty sure that we can derive a regret bound that scales with RVO and MB dimensions (I also think MB dimension can be replaced by prior entropy, but so far hasn’t been able to prove it), which can be bounded either in terms of the number of hypotheses or in terms of the number of states and actions, and can also remain small when both the number of hypotheses and the number of states are large.
Another useful perspective on the conditions the advisor must satisfy, is regarding the environment w.r.t. which these conditions are defined as the belief state of the advisor rather than the true environment. This is difficult to do with the current formalism that requires MDPs, but would be possible with POMDPs for example. Indeed, I took this perspective in an earlier essay about a different setting that allows general environments (see Corollary 1 in that essay). This would lead to a performance guarantee which shows that the agent achieves optimal expected utility w.r.t. the belief state of the advisor. Obviously, this is not as good as optimal expected utility w.r.t. the true environment, however, this means that from the perspective of the advisor, building such an agent is the best possible strategy.
Can you add the key assumptions being made when you say it is safe asymptotically? From skimming, it looked like “assuming the world is an MDP and that a human can recognize which actions lead to catastrophes.”