In environments without resets, an _asymptotically optimal_ agent is one that eventually acts optimally. (It might be the case that the agent first hobbles itself in a decidedly suboptimal way, but _eventually_ it will be rolling out the optimal policy _given_ its current hobbled position.) This paper points out that such agents must explore a lot: after all, it’s always possible that the very next timestep will be the one where chopping off your arm gives you maximal reward forever—how do you _know_ that’s not the case? Since it must explore so much, it is extremely likely that it will fall into a “trap”, where it can no longer get high reward: for example, maybe its actuators are destroyed.
More formally, the paper proves that when an asymptotically optimal agent acts, for any event, either that event occurs, or after some finite time there is no recognizable opportunity to cause the event to happen, even with low probability. Applying this to the event “the agent is destroyed”, we see that either the agent is eventually destroyed, or it becomes _physically impossible_ for the agent to be destroyed, even by itself—given that the latter seems rather unlikely, we would expect that eventually the agent is destroyed.
The authors suggest that safe exploration is not a well-defined problem, since you never know what’s going to happen when you explore, and they propose that instead agents should have their exploration guided by a mentor or <@parent@>(@Parenting: Safe Reinforcement Learning from Human Input@) (see also <@delegative RL@>(@Delegative Reinforcement Learning@), [avoiding catastrophes via human intervention](https://arxiv.org/abs/1707.05173), and [shielding](https://arxiv.org/abs/1708.08611) for more examples).
Planned opinion:
In my opinion on <@Safety Gym@>, I mentioned how a zero-violations constraint for safe exploration would require a mentor or parent that already satisfied the constraint; so in that sense I agree with this paper, which is simply making that statement more formal and precise.
Nonetheless, I still think there is a meaningful notion of exploration that can be done safely: once you have learned a good model that you have reasonable confidence in, you can find areas of the model in which you are uncertain, but you are at least confident that it won’t have permanent negative repercussions, and you can explore there. For example, I often “explore” what foods I like, where I’m uncertain of how much I will like the food, but I’m quite confident that the food will not poison and kill me. (However, this notion of exploration is quite different from the notion of exploration typically used in RL, and might better be called “model-based exploration” or something like that.)
Planned summary for the Alignment Newsletter:
Planned opinion:
Thanks Rohin!