jessicata comments on Online Learning 2: Bandit learning with catastrophes

jessicata 30 Oct 2016 19:06 UTC
0 points
AF
A couple notes:
- The quadratic dependence on $a$ is almost certainly unnecessary; we didn’t try too hard to reduce it. The way to reduce the bound is probably by observing that, for the best arms, the average importance-sampled estimate of the risk has a low mean; showing that as a result the estimate is subgaussian; and then applying a stochastic bandit algorithm that assumes subgaussian tails.
- If we just pulled an arm repeatedly to ensure it’s non catastrophic we’d get dependence on $1 / τ$ which is huge; the main idea of the post is that we can get dependence on $a$ instead of $1 / τ$ .
- paulfchristiano 31 Oct 2016 4:38 UTC
  0 points
  AF Parent
  (I meant sampling $x$ repeatedly from the distribution ${^q}_{i}$ , I agree that sampling $x$ at random won’t help identify rare catastrophes.)
  - jessicata 1 Nov 2016 4:34 UTC
    LW: 1 AF: 1
    AF Parent
    The main qualitative difference from sampling from ${^q}_{i}$ is that we’re targeting a specific tradeoff between catastrophes and reward, rather than zero probability of catastrophe. I agree that when $τ = 0$ we’re just sampling from ${^q}_{i}$ .