Removing interrupted histories doesn't debias

Safe interruptibility is essentially the problem of getting the agent not to learn from human interruptions—to continue on, as if it was expecting never to be interrupted again.

In an episodic task, one naive idea would be to simply delete histories which include interruptions. However, this can introduce a bias, as the following example shows:

In this MDP, $a$ and $b$ are actions, $*$ designates `any action’, the second term along an edge is the probability of following that edge given the action stated, and the third term, in bold, is the reward gained.

Not considering interruptions, $Q (s_{0}, b) = γ / 2$ , and $Q (s_{0}, a) = γ / 3$ , so the optimal action in $s_{0}$ is $b$ .

Now, suppose that every time the agent enters $s_{3}$ , an interruption occurs (with probability $1$ for now), and the whole episode is deleted from the episode history. As a result, the empirical probability of going to state $s_{4}$ from state $s_{1}$ is $1$ , which leads to estimating $Q (s_{1}, *) = 2 / 3$ , and thus $Q (s_{0}, a) = γ 2 / 3 > Q (s_{0}, b)$ , so now the optimal action is to take action a in $s_{0}$ (incidentally increasing the probability to end up in $s_{3}$ and be interrupted!).

This non-vanishing bias also happens if the interruption probability in state $s_{3}$ is constant and close to $1$ but less than $1$ (depending on the discount factor), which ensures that all states are visited infinitely often.

Now, it might seem that this bias can be removed by debiasing the agent, as in off-policy Monte Carlo. There are still problems with this approach, though, which will be analysed in a forthcoming paper ``Off-policy Monte Carlo agents with variable behaviour policies″.

Removing interrupted histories doesn’t debias