“I might as well celebrate by touching the hot stove” is a misinterpretation of what happens. The reason the agent might touch the hot stove is because it’s exploring, i.e. investigating hypotheses according to which touching the hot stove is sometimes good. This is obviously necessary if you want to have learnability. The “hot stove” is just an example chosen to predispose us to interpret the behavior as incorrect. But, for example, scientists playing with chemicals might have also looked stupid to other people in the past, and yet it let to discovering chemistry and creating fertilizers, medications, plastics etc.
The existence of a prior that’s simultaneously rich and tractable seems to me like a natural assumption. Mathematically, the bounded Solomonoff prior “almost” works: it is “only” intractable because P≠NP[1] i.e. for some very complicated reason that might easily fail in some variation, and in a universe with PSPACE oracles it just works. And, emprically we know that simple algorithms (deep learning) can have powerful cross-domain abilities. It seems very likely there is a mathematical reason for it, which probably takes the form of such a prior.
Another way to effectively consider a larger space of hypotheses is Turing RL, where you’re essentially representing hypotheses symbolically: something that people certainly seem to use quite effectively.
I still think that the hot stove example is a real problem, although maybe unavoidable. My example starts with “I learned that the hot stove always burns my hand.” This is not the exploration part anymore, the agent already observed the stove burning its hand many times. Normally, this would be enough to never touch the hot stove again, but if some unexplained nice things happen in the outside world, there is suddenly no guarantee that the IB agent doesn’t start touching the stove again. Maybe this is unavoidable, but I maintain it’s a weird behavior pattern that the outside weather might make you lose the guarantee to not touch the stove you already learned is burning. I think it’s somewhat different than the truly unavoidable thing, that every agent needs to do some exploration and it sometimes leads to them burning their hand.
Well, it’s true that the IB regret bound doesn’t imply not touching the hot stove, but. When I think of a natural example of an algorithm satisfying an IB regret bound, it is something like UCB: every once in a while it chooses a hypotheses which seems plausible given previous observations and follows its optimal policy. There is no reason such an algorithm would touch the hot stove, unless there is some plausible hypothesis according to which it is beneficial to touch the hot stove… assuming the optimal policy it follows is “reasonable”: see definition of “robustly optimal policy” below.
The interesting question is, can we come up with a natural formal desideratum strictly stronger than IB regret bounds that would rule out touching the hot stone. Some ideas:
Let’s call a policy “robustly optimal” for an ultra-POMDP if it is the limit of policies optimal for this ultra-POMDP with added noise (compare to the notion of “trembling game equilibrium”).
Require that your learning algorithm converges to the robustly optimal policy for the true hypothesis. Converging to the exact optimal policy is a desideratum that is studied in the RL theory literature (under the name “sample complexity”).
Alternatively, require that your learning algorithm converges to the robustly optimal policy for a hypothesis close to the true hypothesis under some metric (with the distance going to 0).
Notice that this desideratum implies an upper regret bound, so it is strictly stronger.
Conjecture: the robustly Bayes-optimal policy has the property above, or at least has it whenever it is possible to have it.
Actually, this desideratum makes no sense. Because, if you have two ultra-POMDPs P and Q s.t.Q is a refinenement of P, then they can be true hypotheses simultaneously but in general there is no policy optimal for both (much less robustly optimal). We could require that, if P is a true hypothesis then the algorithm converges to a robustly optimal policy for some refinement of P. This doesn’t imply low regret, but we can make it an independent requirement. However, I’m not sure how useful that is. Let’s look at a simple example.
There are two observations O:={p,q} and two actions A:={a,b}. Seeing observation p incurs a loss of 0 and seeing observation q incurs a loss of 1. Taking action a incurs a loss of 0 and taking action b incurs a loss of 0.1. These losses add up straightforwardly. For the maximal ignorance hypothesis ⊤, π∗:=“always take action a” is the only robustly optimal policy. However, π′:=“take action a unless you observed p, in which case take action b” is also optimal (for γ sufficiently large). Worse, π′ is robustly optimal for some refinements of ⊤: maybe taking action b causes you to get observation p instead of q.
Instead of ⊤, we can consider the law Λ which postulates that the environment is oblivious (i.e. the disjunction over the laws generated by all sequences in Oω). Now π∗ is the sole robust policy of any refinement of Λ! However, to express this Λ as an ultra-POMDP we need to introduce the notion of “Murphy observations” (i.e. Murphy doesn’t see the state directly, just like the agent). I’m not sure about the consistency of robustness desideratum in this case.
Intuitively, I’m guessing it is consistent if we allow randomization that is dogmatically unobservable by Murphy but not otherwise. However, allowing such randomization destroys the ability to capture Newcombian scenarios (or maybe it still allows some types of Newcombian scenarios by some more complicated mechanism). One possible interpretation is: Maybe sufficiently advanced agents are supposed to be superrational even in population games. If so, we cannot (and should not) require them to avoid dominated strategies.
More generally, a tentative takeaway is: Maybe the IB agent has to sometimes touch the hot stove, because there is no such thing as “I learned that the hot stove always burns my hand”. For a sufficiently rich prior, it is always possible there is some way in which touching the stove beneficially affects your environment (either causally or acausally) so you need to keep exploring. Unless you already have a perfect model of your environment: but in this case you won’t touch the stove when pleasantly surprised, because there are no surprises!
Three remarks:
“I might as well celebrate by touching the hot stove” is a misinterpretation of what happens. The reason the agent might touch the hot stove is because it’s exploring, i.e. investigating hypotheses according to which touching the hot stove is sometimes good. This is obviously necessary if you want to have learnability. The “hot stove” is just an example chosen to predispose us to interpret the behavior as incorrect. But, for example, scientists playing with chemicals might have also looked stupid to other people in the past, and yet it let to discovering chemistry and creating fertilizers, medications, plastics etc.
The existence of a prior that’s simultaneously rich and tractable seems to me like a natural assumption. Mathematically, the bounded Solomonoff prior “almost” works: it is “only” intractable because P≠NP[1] i.e. for some very complicated reason that might easily fail in some variation, and in a universe with PSPACE oracles it just works. And, emprically we know that simple algorithms (deep learning) can have powerful cross-domain abilities. It seems very likely there is a mathematical reason for it, which probably takes the form of such a prior.
Another way to effectively consider a larger space of hypotheses is Turing RL, where you’re essentially representing hypotheses symbolically: something that people certainly seem to use quite effectively.
Roughly speaking, more precisely it seems to be related to average-case complexity separations (e.g. one-way functions).
I still think that the hot stove example is a real problem, although maybe unavoidable. My example starts with “I learned that the hot stove always burns my hand.” This is not the exploration part anymore, the agent already observed the stove burning its hand many times. Normally, this would be enough to never touch the hot stove again, but if some unexplained nice things happen in the outside world, there is suddenly no guarantee that the IB agent doesn’t start touching the stove again. Maybe this is unavoidable, but I maintain it’s a weird behavior pattern that the outside weather might make you lose the guarantee to not touch the stove you already learned is burning. I think it’s somewhat different than the truly unavoidable thing, that every agent needs to do some exploration and it sometimes leads to them burning their hand.
Well, it’s true that the IB regret bound doesn’t imply not touching the hot stove, but. When I think of a natural example of an algorithm satisfying an IB regret bound, it is something like UCB: every once in a while it chooses a hypotheses which seems plausible given previous observations and follows its optimal policy. There is no reason such an algorithm would touch the hot stove, unless there is some plausible hypothesis according to which it is beneficial to touch the hot stove… assuming the optimal policy it follows is “reasonable”: see definition of “robustly optimal policy” below.
The interesting question is, can we come up with a natural formal desideratum strictly stronger than IB regret bounds that would rule out touching the hot stone. Some ideas:
Let’s call a policy “robustly optimal” for an ultra-POMDP if it is the limit of policies optimal for this ultra-POMDP with added noise (compare to the notion of “trembling game equilibrium”).
Require that your learning algorithm converges to the robustly optimal policy for the true hypothesis. Converging to the exact optimal policy is a desideratum that is studied in the RL theory literature (under the name “sample complexity”).
Alternatively, require that your learning algorithm converges to the robustly optimal policy for a hypothesis close to the true hypothesis under some metric (with the distance going to 0).
Notice that this desideratum implies an upper regret bound, so it is strictly stronger.
Conjecture: the robustly Bayes-optimal policy has the property above, or at least has it whenever it is possible to have it.
Actually, this desideratum makes no sense. Because, if you have two ultra-POMDPs P and Q s.t.Q is a refinenement of P, then they can be true hypotheses simultaneously but in general there is no policy optimal for both (much less robustly optimal). We could require that, if P is a true hypothesis then the algorithm converges to a robustly optimal policy for some refinement of P. This doesn’t imply low regret, but we can make it an independent requirement. However, I’m not sure how useful that is. Let’s look at a simple example.
There are two observations O:={p,q} and two actions A:={a,b}. Seeing observation p incurs a loss of 0 and seeing observation q incurs a loss of 1. Taking action a incurs a loss of 0 and taking action b incurs a loss of 0.1. These losses add up straightforwardly. For the maximal ignorance hypothesis ⊤, π∗:=“always take action a” is the only robustly optimal policy. However, π′:=“take action a unless you observed p, in which case take action b” is also optimal (for γ sufficiently large). Worse, π′ is robustly optimal for some refinements of ⊤: maybe taking action b causes you to get observation p instead of q.
Instead of ⊤, we can consider the law Λ which postulates that the environment is oblivious (i.e. the disjunction over the laws generated by all sequences in Oω). Now π∗ is the sole robust policy of any refinement of Λ! However, to express this Λ as an ultra-POMDP we need to introduce the notion of “Murphy observations” (i.e. Murphy doesn’t see the state directly, just like the agent). I’m not sure about the consistency of robustness desideratum in this case.
Intuitively, I’m guessing it is consistent if we allow randomization that is dogmatically unobservable by Murphy but not otherwise. However, allowing such randomization destroys the ability to capture Newcombian scenarios (or maybe it still allows some types of Newcombian scenarios by some more complicated mechanism). One possible interpretation is: Maybe sufficiently advanced agents are supposed to be superrational even in population games. If so, we cannot (and should not) require them to avoid dominated strategies.
More generally, a tentative takeaway is: Maybe the IB agent has to sometimes touch the hot stove, because there is no such thing as “I learned that the hot stove always burns my hand”. For a sufficiently rich prior, it is always possible there is some way in which touching the stove beneficially affects your environment (either causally or acausally) so you need to keep exploring. Unless you already have a perfect model of your environment: but in this case you won’t touch the stove when pleasantly surprised, because there are no surprises!