Well, it’s true that the IB regret bound doesn’t imply not touching the hot stove, but. When I think of a natural example of an algorithm satisfying an IB regret bound, it is something like UCB: every once in a while it chooses a hypotheses which seems plausible given previous observations and follows its optimal policy. There is no reason such an algorithm would touch the hot stove, unless there is some plausible hypothesis according to which it is beneficial to touch the hot stove… assuming the optimal policy it follows is “reasonable”: see definition of “robustly optimal policy” below.
The interesting question is, can we come up with a natural formal desideratum strictly stronger than IB regret bounds that would rule out touching the hot stone. Some ideas:
Let’s call a policy “robustly optimal” for an ultra-POMDP if it is the limit of policies optimal for this ultra-POMDP with added noise (compare to the notion of “trembling game equilibrium”).
Require that your learning algorithm converges to the robustly optimal policy for the true hypothesis. Converging to the exact optimal policy is a desideratum that is studied in the RL theory literature (under the name “sample complexity”).
Alternatively, require that your learning algorithm converges to the robustly optimal policy for a hypothesis close to the true hypothesis under some metric (with the distance going to 0).
Notice that this desideratum implies an upper regret bound, so it is strictly stronger.
Conjecture: the robustly Bayes-optimal policy has the property above, or at least has it whenever it is possible to have it.
Actually, this desideratum makes no sense. Because, if you have two ultra-POMDPs P and Q s.t.Q is a refinenement of P, then they can be true hypotheses simultaneously but in general there is no policy optimal for both (much less robustly optimal). We could require that, if P is a true hypothesis then the algorithm converges to a robustly optimal policy for some refinement of P. This doesn’t imply low regret, but we can make it an independent requirement. However, I’m not sure how useful that is. Let’s look at a simple example.
There are two observations O:={p,q} and two actions A:={a,b}. Seeing observation p incurs a loss of 0 and seeing observation q incurs a loss of 1. Taking action a incurs a loss of 0 and taking action b incurs a loss of 0.1. These losses add up straightforwardly. For the maximal ignorance hypothesis ⊤, π∗:=“always take action a” is the only robustly optimal policy. However, π′:=“take action a unless you observed p, in which case take action b” is also optimal (for γ sufficiently large). Worse, π′ is robustly optimal for some refinements of ⊤: maybe taking action b causes you to get observation p instead of q.
Instead of ⊤, we can consider the law Λ which postulates that the environment is oblivious (i.e. the disjunction over the laws generated by all sequences in Oω). Now π∗ is the sole robust policy of any refinement of Λ! However, to express this Λ as an ultra-POMDP we need to introduce the notion of “Murphy observations” (i.e. Murphy doesn’t see the state directly, just like the agent). I’m not sure about the consistency of robustness desideratum in this case.
Intuitively, I’m guessing it is consistent if we allow randomization that is dogmatically unobservable by Murphy but not otherwise. However, allowing such randomization destroys the ability to capture Newcombian scenarios (or maybe it still allows some types of Newcombian scenarios by some more complicated mechanism). One possible interpretation is: Maybe sufficiently advanced agents are supposed to be superrational even in population games. If so, we cannot (and should not) require them to avoid dominated strategies.
More generally, a tentative takeaway is: Maybe the IB agent has to sometimes touch the hot stove, because there is no such thing as “I learned that the hot stove always burns my hand”. For a sufficiently rich prior, it is always possible there is some way in which touching the stove beneficially affects your environment (either causally or acausally) so you need to keep exploring. Unless you already have a perfect model of your environment: but in this case you won’t touch the stove when pleasantly surprised, because there are no surprises!
Well, it’s true that the IB regret bound doesn’t imply not touching the hot stove, but. When I think of a natural example of an algorithm satisfying an IB regret bound, it is something like UCB: every once in a while it chooses a hypotheses which seems plausible given previous observations and follows its optimal policy. There is no reason such an algorithm would touch the hot stove, unless there is some plausible hypothesis according to which it is beneficial to touch the hot stove… assuming the optimal policy it follows is “reasonable”: see definition of “robustly optimal policy” below.
The interesting question is, can we come up with a natural formal desideratum strictly stronger than IB regret bounds that would rule out touching the hot stone. Some ideas:
Let’s call a policy “robustly optimal” for an ultra-POMDP if it is the limit of policies optimal for this ultra-POMDP with added noise (compare to the notion of “trembling game equilibrium”).
Require that your learning algorithm converges to the robustly optimal policy for the true hypothesis. Converging to the exact optimal policy is a desideratum that is studied in the RL theory literature (under the name “sample complexity”).
Alternatively, require that your learning algorithm converges to the robustly optimal policy for a hypothesis close to the true hypothesis under some metric (with the distance going to 0).
Notice that this desideratum implies an upper regret bound, so it is strictly stronger.
Conjecture: the robustly Bayes-optimal policy has the property above, or at least has it whenever it is possible to have it.
Actually, this desideratum makes no sense. Because, if you have two ultra-POMDPs P and Q s.t.Q is a refinenement of P, then they can be true hypotheses simultaneously but in general there is no policy optimal for both (much less robustly optimal). We could require that, if P is a true hypothesis then the algorithm converges to a robustly optimal policy for some refinement of P. This doesn’t imply low regret, but we can make it an independent requirement. However, I’m not sure how useful that is. Let’s look at a simple example.
There are two observations O:={p,q} and two actions A:={a,b}. Seeing observation p incurs a loss of 0 and seeing observation q incurs a loss of 1. Taking action a incurs a loss of 0 and taking action b incurs a loss of 0.1. These losses add up straightforwardly. For the maximal ignorance hypothesis ⊤, π∗:=“always take action a” is the only robustly optimal policy. However, π′:=“take action a unless you observed p, in which case take action b” is also optimal (for γ sufficiently large). Worse, π′ is robustly optimal for some refinements of ⊤: maybe taking action b causes you to get observation p instead of q.
Instead of ⊤, we can consider the law Λ which postulates that the environment is oblivious (i.e. the disjunction over the laws generated by all sequences in Oω). Now π∗ is the sole robust policy of any refinement of Λ! However, to express this Λ as an ultra-POMDP we need to introduce the notion of “Murphy observations” (i.e. Murphy doesn’t see the state directly, just like the agent). I’m not sure about the consistency of robustness desideratum in this case.
Intuitively, I’m guessing it is consistent if we allow randomization that is dogmatically unobservable by Murphy but not otherwise. However, allowing such randomization destroys the ability to capture Newcombian scenarios (or maybe it still allows some types of Newcombian scenarios by some more complicated mechanism). One possible interpretation is: Maybe sufficiently advanced agents are supposed to be superrational even in population games. If so, we cannot (and should not) require them to avoid dominated strategies.
More generally, a tentative takeaway is: Maybe the IB agent has to sometimes touch the hot stove, because there is no such thing as “I learned that the hot stove always burns my hand”. For a sufficiently rich prior, it is always possible there is some way in which touching the stove beneficially affects your environment (either causally or acausally) so you need to keep exploring. Unless you already have a perfect model of your environment: but in this case you won’t touch the stove when pleasantly surprised, because there are no surprises!