I think that there is an interesting subproblem here, which is, given a policy (a RL agent), determine the belief state of this policy. I haven’t thought much about this, but it seems reasonable to look at belief states (distributions over environments) w.r.t. which the policy has a sufficiently strong regret bound. This might still leave a lot of ambiguity that has to be regularized somehow.
I think that there is an interesting subproblem here, which is, given a policy (a RL agent), determine the belief state of this policy. I haven’t thought much about this, but it seems reasonable to look at belief states (distributions over environments) w.r.t. which the policy has a sufficiently strong regret bound. This might still leave a lot of ambiguity that has to be regularized somehow.