It is true that going beyond finite MDPs (more generally, environments satisfying sufficient ergodicity assumptions) causes problems but I believe it is possible to overcome them. For example, we can assume that there is a baseline policy (the advisor policy in case of DRL) s.t. the resulting trajectory in state space never (up to catastrophes) diverges from the optimal trajectory (or, less ambitiously, some “target” trajectory) further than some “distance” (measured in terms of the time it would take to go back to the optimal trajectory).
I think that in the real world, most superficially reasonable actions do not have irreversible consequences that are very important. So, this assumption can hold within some approximation, and this should lead to a performance guarantee that is optimal within the accuracy of this approximation.
It is true that going beyond finite MDPs (more generally, environments satisfying sufficient ergodicity assumptions) causes problems but I believe it is possible to overcome them. For example, we can assume that there is a baseline policy (the advisor policy in case of DRL) s.t. the resulting trajectory in state space never (up to catastrophes) diverges from the optimal trajectory (or, less ambitiously, some “target” trajectory) further than some “distance” (measured in terms of the time it would take to go back to the optimal trajectory).
In the real world, this is usually impossible.
I think that in the real world, most superficially reasonable actions do not have irreversible consequences that are very important. So, this assumption can hold within some approximation, and this should lead to a performance guarantee that is optimal within the accuracy of this approximation.