Actually, this desideratum makes no sense. Because, if you have two ultra-POMDPs P and Q s.t.Q is a refinenement of P, then they can be true hypotheses simultaneously but in general there is no policy optimal for both (much less robustly optimal). We could require that, if P is a true hypothesis then the algorithm converges to a robustly optimal policy for some refinement of P. This doesn’t imply low regret, but we can make it an independent requirement. However, I’m not sure how useful that is. Let’s look at a simple example.
There are two observations O:={p,q} and two actions A:={a,b}. Seeing observation p incurs a loss of 0 and seeing observation q incurs a loss of 1. Taking action a incurs a loss of 0 and taking action b incurs a loss of 0.1. These losses add up straightforwardly. For the maximal ignorance hypothesis ⊤, π∗:=“always take action a” is the only robustly optimal policy. However, π′:=“take action a unless you observed p, in which case take action b” is also optimal (for γ sufficiently large). Worse, π′ is robustly optimal for some refinements of ⊤: maybe taking action b causes you to get observation p instead of q.
Instead of ⊤, we can consider the law Λ which postulates that the environment is oblivious (i.e. the disjunction over the laws generated by all sequences in Oω). Now π∗ is the sole robust policy of any refinement of Λ! However, to express this Λ as an ultra-POMDP we need to introduce the notion of “Murphy observations” (i.e. Murphy doesn’t see the state directly, just like the agent). I’m not sure about the consistency of robustness desideratum in this case.
Intuitively, I’m guessing it is consistent if we allow randomization that is dogmatically unobservable by Murphy but not otherwise. However, allowing such randomization destroys the ability to capture Newcombian scenarios (or maybe it still allows some types of Newcombian scenarios by some more complicated mechanism). One possible interpretation is: Maybe sufficiently advanced agents are supposed to be superrational even in population games. If so, we cannot (and should not) require them to avoid dominated strategies.
More generally, a tentative takeaway is: Maybe the IB agent has to sometimes touch the hot stove, because there is no such thing as “I learned that the hot stove always burns my hand”. For a sufficiently rich prior, it is always possible there is some way in which touching the stove beneficially affects your environment (either causally or acausally) so you need to keep exploring. Unless you already have a perfect model of your environment: but in this case you won’t touch the stove when pleasantly surprised, because there are no surprises!
Actually, this desideratum makes no sense. Because, if you have two ultra-POMDPs P and Q s.t.Q is a refinenement of P, then they can be true hypotheses simultaneously but in general there is no policy optimal for both (much less robustly optimal). We could require that, if P is a true hypothesis then the algorithm converges to a robustly optimal policy for some refinement of P. This doesn’t imply low regret, but we can make it an independent requirement. However, I’m not sure how useful that is. Let’s look at a simple example.
There are two observations O:={p,q} and two actions A:={a,b}. Seeing observation p incurs a loss of 0 and seeing observation q incurs a loss of 1. Taking action a incurs a loss of 0 and taking action b incurs a loss of 0.1. These losses add up straightforwardly. For the maximal ignorance hypothesis ⊤, π∗:=“always take action a” is the only robustly optimal policy. However, π′:=“take action a unless you observed p, in which case take action b” is also optimal (for γ sufficiently large). Worse, π′ is robustly optimal for some refinements of ⊤: maybe taking action b causes you to get observation p instead of q.
Instead of ⊤, we can consider the law Λ which postulates that the environment is oblivious (i.e. the disjunction over the laws generated by all sequences in Oω). Now π∗ is the sole robust policy of any refinement of Λ! However, to express this Λ as an ultra-POMDP we need to introduce the notion of “Murphy observations” (i.e. Murphy doesn’t see the state directly, just like the agent). I’m not sure about the consistency of robustness desideratum in this case.
Intuitively, I’m guessing it is consistent if we allow randomization that is dogmatically unobservable by Murphy but not otherwise. However, allowing such randomization destroys the ability to capture Newcombian scenarios (or maybe it still allows some types of Newcombian scenarios by some more complicated mechanism). One possible interpretation is: Maybe sufficiently advanced agents are supposed to be superrational even in population games. If so, we cannot (and should not) require them to avoid dominated strategies.
More generally, a tentative takeaway is: Maybe the IB agent has to sometimes touch the hot stove, because there is no such thing as “I learned that the hot stove always burns my hand”. For a sufficiently rich prior, it is always possible there is some way in which touching the stove beneficially affects your environment (either causally or acausally) so you need to keep exploring. Unless you already have a perfect model of your environment: but in this case you won’t touch the stove when pleasantly surprised, because there are no surprises!