One subject I like to harp on is reinforcement learning with traps (actions that cause irreversible long term damage). Traps are important for two reasons. One is that the presence of traps is in the heart of the AI risk concept: attacks on the user, corruption of the input/reward channels, and harmful self-modification can all be conceptualized as traps. Another is that without understanding traps we can’t understand long-term planning, which is a key ingredient of goal-directed intelligence.
In general, a prior that contains traps will be unlearnable, meaning that no algorithm has Bayesian regret going to zero in the γ→1 limit. The only obvious natural requirement for RL agents in this case is approximating Bayes-optimality. However, Bayes-optimality is not even “weakly feasible”: it is NP-hard w.r.t. using the number of states and number of hypotheses as security parameters. IMO, the central question is: what kind of natural tractable approximations are there?
Although a generic prior with traps is unlearnable, some priors with traps are learnable. Indeed, it can happen that it’s possible to study the environment is a predictably safe way that is guaranteed to produce enough information about the irreversible transitions. Intuitively, as humans we do often use this kind of strategy. But, it is NP-hard to even check whether a given prior is learnable. Therefore, it seems natural to look for particular types of learnable priors that are efficiently decidable.
In particular, consider the following setting, that I call “expanding safety envelope” (XSE). Assume that each hypothesis in the prior ζ is “decorated” by a set F of state-action pairs s.t. (i) any (s,a)∈F is safe, i.e. the leading term of Q(s,a,γ) in the γ→1 expansion is maximal (ii) for each s∈S, there is (s,a)∈F s.t.a is Blackwell-optimal for s (as a special case we can let F contain all safe actions). Imagine an agent that takes random actions among those a priori known to be in F. If there is no such action, it explodes. Then, it is weakly feasible to check (i) whether the agent will explode (ii) for each hypothesis, to which sets of states it can converge. Now, let the agent update on the transition kernel of the set of actions it converged to. This may lead to new actions becoming certainly known to be in F. We can then let the agent continue exploring using this new set. Iterating this procedure, the agent either discovers enough safe actions to find an optimal policy, or not. Importantly, deciding this is weakly feasible. This is because, for each hypothesis (i) on the first iteration the possible asymptotic state sets are disjoint (ii) on subsequent iterations we might as well assume they are disjoint, since it’s possible to see that if you reach a particular state of an asymptotic set state, then you can add the entire set state (this modification will not create new final outcomes and will only eliminate final outcomes that are better than those remaining). Therefore the number of asymptotic state sets you have to store on each iteration is bounded by the total number of states.
The next questions are (i) what kind of regret bounds we can prove for decorated priors that are XSE-learnable? (ii) given an arbitrary decorated prior, is it possible to find the maximal-probability-mass set of hypotheses, which is XSE-learnable? I speculate that the second question might turn out to be related to the unique games conjecture. By analogy with other optimization problems that are feasible only when maximal score can be achieved, maybe the UGC implies that we cannot find the maximal set but we can find a set that is approximately maximal, with an optimal approximation ratio (using a sum-of-squares algorithm). Also, it might make sense to formulate stronger desiderata which reflect that, if the agent assumes a particular subset of the prior but discovers that it was wrong, it will still do its best in the following. That is, in this case the agent might fall into a trap but at least it will try to avoid further traps.
This has implications even for learning without traps. Indeed, most known theoretical regret bounds involve a parameter that has to do with how costly mistakes is it possible to make. This parameter can manifest as the MDP diameter, the bias span or the mixing time. Such regret bounds seem unsatisfactory since the worst-case mistake determines the entire guarantee. We can take the perspective that such costly but reversible mistakes are “quasi-traps”: not actual traps, but trap-like on short timescales. This suggests that applying an approach like XSE to quasi-traps should lead to qualitatively stronger regret bounds. Such regret bounds would imply learning faster on less data, and in episodic learning they would imply learning inside each episode, something that is notoriously absent in modern episodic RL systems like AlphaStar.
Moreover, we can also use this to do away with ergodicity assumptions. Ergodicity assumptions require the agent to “not wander too far” in state space, in the simplest case because the entire state space is small. But, instead of “wandering far” from a fixed place in state space, we can constrain “wandering far” w.r.t. to the optimal trajectory. Combining this with XSE, this should lead to guarantees that depend on the prevalence of irreversible and quasi-irreversible departures from this trajectory.
In multi-armed bandits and RL theory, there is a principle known as “optimism in the face of uncertainty”. This principle says, you should always make optimistic assumptions: if you are wrong, you will find out (because you will get less reward than you expected). It explicitly underlies UCB algorithms and is implicit in other algorithms, like Thomson sampling. But, this fails miserably in the presence of traps. I think that approaches like XSE point at a more nuanced principle: “optimism in the face of cheap-to-resolve uncertainty, pessimism in the face of expensive-to-resolve uncertainty”. Following this principle doesn’t lead to actual Bayes-optimality, but perhaps it is in some sense a good enough approximation.
One subject I like to harp on is reinforcement learning with traps (actions that cause irreversible long term damage). Traps are important for two reasons. One is that the presence of traps is in the heart of the AI risk concept: attacks on the user, corruption of the input/reward channels, and harmful self-modification can all be conceptualized as traps. Another is that without understanding traps we can’t understand long-term planning, which is a key ingredient of goal-directed intelligence.
In general, a prior that contains traps will be unlearnable, meaning that no algorithm has Bayesian regret going to zero in the γ→1 limit. The only obvious natural requirement for RL agents in this case is approximating Bayes-optimality. However, Bayes-optimality is not even “weakly feasible”: it is NP-hard w.r.t. using the number of states and number of hypotheses as security parameters. IMO, the central question is: what kind of natural tractable approximations are there?
Although a generic prior with traps is unlearnable, some priors with traps are learnable. Indeed, it can happen that it’s possible to study the environment is a predictably safe way that is guaranteed to produce enough information about the irreversible transitions. Intuitively, as humans we do often use this kind of strategy. But, it is NP-hard to even check whether a given prior is learnable. Therefore, it seems natural to look for particular types of learnable priors that are efficiently decidable.
In particular, consider the following setting, that I call “expanding safety envelope” (XSE). Assume that each hypothesis in the prior ζ is “decorated” by a set F of state-action pairs s.t. (i) any (s,a)∈F is safe, i.e. the leading term of Q(s,a,γ) in the γ→1 expansion is maximal (ii) for each s∈S, there is (s,a)∈F s.t.a is Blackwell-optimal for s (as a special case we can let F contain all safe actions). Imagine an agent that takes random actions among those a priori known to be in F. If there is no such action, it explodes. Then, it is weakly feasible to check (i) whether the agent will explode (ii) for each hypothesis, to which sets of states it can converge. Now, let the agent update on the transition kernel of the set of actions it converged to. This may lead to new actions becoming certainly known to be in F. We can then let the agent continue exploring using this new set. Iterating this procedure, the agent either discovers enough safe actions to find an optimal policy, or not. Importantly, deciding this is weakly feasible. This is because, for each hypothesis (i) on the first iteration the possible asymptotic state sets are disjoint (ii) on subsequent iterations we might as well assume they are disjoint, since it’s possible to see that if you reach a particular state of an asymptotic set state, then you can add the entire set state (this modification will not create new final outcomes and will only eliminate final outcomes that are better than those remaining). Therefore the number of asymptotic state sets you have to store on each iteration is bounded by the total number of states.
The next questions are (i) what kind of regret bounds we can prove for decorated priors that are XSE-learnable? (ii) given an arbitrary decorated prior, is it possible to find the maximal-probability-mass set of hypotheses, which is XSE-learnable? I speculate that the second question might turn out to be related to the unique games conjecture. By analogy with other optimization problems that are feasible only when maximal score can be achieved, maybe the UGC implies that we cannot find the maximal set but we can find a set that is approximately maximal, with an optimal approximation ratio (using a sum-of-squares algorithm). Also, it might make sense to formulate stronger desiderata which reflect that, if the agent assumes a particular subset of the prior but discovers that it was wrong, it will still do its best in the following. That is, in this case the agent might fall into a trap but at least it will try to avoid further traps.
This has implications even for learning without traps. Indeed, most known theoretical regret bounds involve a parameter that has to do with how costly mistakes is it possible to make. This parameter can manifest as the MDP diameter, the bias span or the mixing time. Such regret bounds seem unsatisfactory since the worst-case mistake determines the entire guarantee. We can take the perspective that such costly but reversible mistakes are “quasi-traps”: not actual traps, but trap-like on short timescales. This suggests that applying an approach like XSE to quasi-traps should lead to qualitatively stronger regret bounds. Such regret bounds would imply learning faster on less data, and in episodic learning they would imply learning inside each episode, something that is notoriously absent in modern episodic RL systems like AlphaStar.
Moreover, we can also use this to do away with ergodicity assumptions. Ergodicity assumptions require the agent to “not wander too far” in state space, in the simplest case because the entire state space is small. But, instead of “wandering far” from a fixed place in state space, we can constrain “wandering far” w.r.t. to the optimal trajectory. Combining this with XSE, this should lead to guarantees that depend on the prevalence of irreversible and quasi-irreversible departures from this trajectory.
In multi-armed bandits and RL theory, there is a principle known as “optimism in the face of uncertainty”. This principle says, you should always make optimistic assumptions: if you are wrong, you will find out (because you will get less reward than you expected). It explicitly underlies UCB algorithms and is implicit in other algorithms, like Thomson sampling. But, this fails miserably in the presence of traps. I think that approaches like XSE point at a more nuanced principle: “optimism in the face of cheap-to-resolve uncertainty, pessimism in the face of expensive-to-resolve uncertainty”. Following this principle doesn’t lead to actual Bayes-optimality, but perhaps it is in some sense a good enough approximation.