Talking of yourself in third person? :)
Cool paper!
Anyway I’m a bit bothered by the theta thing, the probability that the agent complies with the interruption command. If I understand correctly, you can make it converge to 1, but if it converges to quickly then the agent learns a biased model of the world, while if it converges too slowly it is unsafe of course.
I’m not sure if this is just a technicality that can be circumvented or if it represents a fundamental issue: in order for the agent to learn what happens after the interruption switch is pressed, it must ignore the interruption switch with some non-negligible probability, which means that you can’t trust the interruption switch as a failsafe mechanism.
Very interesting, thanks for sharing.