I sort of object to titling this post “Value Learning is only Asymptotically Safe” when the actual point you make is that we don’t yet have concrete optimality results for value learning other than asymptotic safety.
We should definitely be writing down sets assumptions from which we can derive formal results about the expected behavior of an agent, but is there anything to aim for that is stronger than asymptotic safety?
In the case of value learning, given the generous assumption that “we somehow figured out how to design an agent which understood what constituted observational evidence of humanity’s reflectively-endorsed utility function”, it seems like you should be able to get a PAC-type bound, where by time T, the agent is only ϵ-suboptimal with probability 1−δ(ϵ,T), where δ(ϵ,T) is increasing in ϵ but decreasing in T -- see results on PAC bounds for Bayesian learning, which I haven’t actually looked at. This gives you bounds stronger than asymptotic optimality for value leraning. Sadly, if you want your agent to actually behave well in general environments, you probably won’t get results better than asymptotic optimality, but if you’re happy to restrict yourself to MDPs, you probably can.
It is true that going beyond finite MDPs (more generally, environments satisfying sufficient ergodicity assumptions) causes problems but I believe it is possible to overcome them. For example, we can assume that there is a baseline policy (the advisor policy in case of DRL) s.t. the resulting trajectory in state space never (up to catastrophes) diverges from the optimal trajectory (or, less ambitiously, some “target” trajectory) further than some “distance” (measured in terms of the time it would take to go back to the optimal trajectory).
I think that in the real world, most superficially reasonable actions do not have irreversible consequences that are very important. So, this assumption can hold within some approximation, and this should lead to a performance guarantee that is optimal within the accuracy of this approximation.
I sort of object to titling this post “Value Learning is only Asymptotically Safe” when the actual point you make is that we don’t yet have concrete optimality results for value learning other than asymptotic safety.
Doesn’t the cosmic ray example point to a strictly positive probability of dangerous behavior?
EDIT: Nvm I see what you’re saying. If I’m understanding correctly, you’d prefer, e.g. “Value Learning is not [Safe with Probability 1]”.
I sort of object to titling this post “Value Learning is only Asymptotically Safe” when the actual point you make is that we don’t yet have concrete optimality results for value learning other than asymptotic safety.
In the case of value learning, given the generous assumption that “we somehow figured out how to design an agent which understood what constituted observational evidence of humanity’s reflectively-endorsed utility function”, it seems like you should be able to get a PAC-type bound, where by time T, the agent is only ϵ-suboptimal with probability 1−δ(ϵ,T), where δ(ϵ,T) is increasing in ϵ but decreasing in T -- see results on PAC bounds for Bayesian learning, which I haven’t actually looked at. This gives you bounds stronger than asymptotic optimality for value leraning. Sadly, if you want your agent to actually behave well in general environments, you probably won’t get results better than asymptotic optimality, but if you’re happy to restrict yourself to MDPs, you probably can.
It is true that going beyond finite MDPs (more generally, environments satisfying sufficient ergodicity assumptions) causes problems but I believe it is possible to overcome them. For example, we can assume that there is a baseline policy (the advisor policy in case of DRL) s.t. the resulting trajectory in state space never (up to catastrophes) diverges from the optimal trajectory (or, less ambitiously, some “target” trajectory) further than some “distance” (measured in terms of the time it would take to go back to the optimal trajectory).
In the real world, this is usually impossible.
I think that in the real world, most superficially reasonable actions do not have irreversible consequences that are very important. So, this assumption can hold within some approximation, and this should lead to a performance guarantee that is optimal within the accuracy of this approximation.
Doesn’t the cosmic ray example point to a strictly positive probability of dangerous behavior?
EDIT: Nvm I see what you’re saying. If I’m understanding correctly, you’d prefer, e.g. “Value Learning is not [Safe with Probability 1]”.
Thanks for the pointer to PAC-type bounds.