So the AI only takes action a from state s if it has already seen the human do that? If so, that seems like the root of all the safety guarantees to me.
Not quite. The AI starts with some prior over (environment, advisor policy) pairs and updates it with incoming observations. It can take an action if, given its current belief state, it is sufficiently confident that it is an action the advisor could take. The confidence threshold is controlled by the parameter η which has a certain optimal value to achieve the best regret bound (as γ→1, η→0; in other words, the more long-term the plan is, the more cautious the AI becomes; obviously catastrophes modify this trade-off). That is, the AI generalizes from what it already observed rather than requiring the exact same state to repeat itself. Indeed, if we required the exact same state to repeat itself, the regret bound would scale with the number of states. Instead, it scales with the number of hypotheses (of course we can also derive a “structural” / “non-uniform” version for a countable number of hypotheses). Also, I am pretty sure that we can derive a regret bound that scales with RVO and MB dimensions (I also think MB dimension can be replaced by prior entropy, but so far hasn’t been able to prove it), which can be bounded either in terms of the number of hypotheses or in terms of the number of states and actions, and can also remain small when both the number of hypotheses and the number of states are large.
So the AI only takes action a from state s if it has already seen the human do that? If so, that seems like the root of all the safety guarantees to me.
Not quite. The AI starts with some prior over (environment, advisor policy) pairs and updates it with incoming observations. It can take an action if, given its current belief state, it is sufficiently confident that it is an action the advisor could take. The confidence threshold is controlled by the parameter η which has a certain optimal value to achieve the best regret bound (as γ→1, η→0; in other words, the more long-term the plan is, the more cautious the AI becomes; obviously catastrophes modify this trade-off). That is, the AI generalizes from what it already observed rather than requiring the exact same state to repeat itself. Indeed, if we required the exact same state to repeat itself, the regret bound would scale with the number of states. Instead, it scales with the number of hypotheses (of course we can also derive a “structural” / “non-uniform” version for a countable number of hypotheses). Also, I am pretty sure that we can derive a regret bound that scales with RVO and MB dimensions (I also think MB dimension can be replaced by prior entropy, but so far hasn’t been able to prove it), which can be bounded either in terms of the number of hypotheses or in terms of the number of states and actions, and can also remain small when both the number of hypotheses and the number of states are large.