The quadratic dependence on a is almost certainly unnecessary; we didn’t try too hard to reduce it. The way to reduce the bound is probably by observing that, for the best arms, the average importance-sampled estimate of the risk has a low mean; showing that as a result the estimate is subgaussian; and then applying a stochastic bandit algorithm that assumes subgaussian tails.
If we just pulled an arm repeatedly to ensure it’s non catastrophic we’d get dependence on 1/τ which is huge; the main idea of the post is that we can get dependence on a instead of 1/τ.
The main qualitative difference from sampling from ^qi is that we’re targeting a specific tradeoff between catastrophes and reward, rather than zero probability of catastrophe. I agree that when τ=0 we’re just sampling from ^qi.
A couple notes:
The quadratic dependence on a is almost certainly unnecessary; we didn’t try too hard to reduce it. The way to reduce the bound is probably by observing that, for the best arms, the average importance-sampled estimate of the risk has a low mean; showing that as a result the estimate is subgaussian; and then applying a stochastic bandit algorithm that assumes subgaussian tails.
If we just pulled an arm repeatedly to ensure it’s non catastrophic we’d get dependence on 1/τ which is huge; the main idea of the post is that we can get dependence on a instead of 1/τ.
(I meant sampling x repeatedly from the distribution ^qi, I agree that sampling x at random won’t help identify rare catastrophes.)
The main qualitative difference from sampling from ^qi is that we’re targeting a specific tradeoff between catastrophes and reward, rather than zero probability of catastrophe. I agree that when τ=0 we’re just sampling from ^qi.