The argument for AI risk typically involves some point at which an AI system does something unexpected and bad in a new situation that we haven’t seen before (as in e.g. a treacherous turn). One way to mitigate the risk is to simply detect new situations, and ensure the AI system does something known to be safe in such situations, e.g. deferring to a human, or executing some handcoded safe baseline policy. Typical approaches involve a separate anomaly detection model. This paper considers: can we use the AI system itself to figure out when to defer to a mentor?
_The key insight is that if an AI system maintains a distribution over rewards, and “assumes the worst” about the reward in new situations, then simply by deferring to the mentor with higher probability when the mentor would get higher expected reward, it will end up deferring to the mentor in new situations._ Hence, the title: by making the agent pessimistic about unknown unknowns (new situations), we get a conservative agent that defers to its mentor in new situations.
This is formalized in an AIXI-like setting, where we have agents that can have beliefs over all computable programs, and we only consider an online learning setting where there is a single trajectory over all time (i.e. no episodes). The math is fairly dense and I didn’t try to fully understand it; as a result my summary may be inaccurate. The agent maintains a belief over world models (which predict how the environment evolves and how reward is given) and mentor models (which predict what the mentor will do, where the mentor’s policy can depend on the **true** world model). It considers the β most likely world models (where β is a hyperparameter between 0 and 1). It computes the worst-case reward it could achieve under these world models, and the expected reward that the mentor achieves. It is more likely to defer to the mentor when the mentor’s expected reward is higher (relative to its worst-case reward).
Such an agent queries the mentor finitely many times and eventually takes actions that are at least as good as the mentor’s choices in those situations. In addition, for events with some bound on complexity, we can set things up (e.g. by having a high β) such that for any event, with high probability the agent never causes the event to occur unless the mentor has already caused the event to occur some time in the past. For example, with high probability the agent will never push the big red button in the environment, unless it has seen the mentor push the big red button in the past.
Planned opinion:
I think it is an underrated point that in some sense all we need to do to avoid x-risk is to make sure AI systems don’t do crazy high-impact things in new situations, and that risk aversion is one way to get such an agent. This is also how <@Inverse Reward Design@> gets its safety properties: when faced with a completely new “lava” tile that the agent has never seen before, the paper’s technique only infers that it should be _uncertain_ about the tile’s reward. However, the _expected_ reward is still 0, and to get the agent to actually avoid the lava you need to use risk-averse planning.
The case for pessimism is similar to the case for impact measures, and similar critiques apply: it is not clear that we can get a value-agnostic method that is both sufficiently safe to rule out all catastrophes, and sufficiently useful to replace other AI techniques. The author himself points out that if we set β high enough to be confident it is safe, the resulting agent may end up always deferring to the mentor, and so not actually be of any use. Nonetheless, I think it’s valuable to point out these ways that seem to confer some nice properties on our agents, even if they can’t be pushed to the extremes for fear of making the agents useless.
Planned summary for the Alignment Newsletter:
Planned opinion: