fwang comments on The correct response to uncertainty is not half-speed

fwang 16 Jan 2016 3:59 UTC
2 points
When there’s no clear winner, the winner can’t take all.

https://en.wikipedia.org/wiki/Winner-take-all_in_action_selection
- gwern 19 Mar 2020 1:00 UTC
  10 points
  Parent
  Happens all the time in decision theory & reinforcement learning: the average of many good plans is often a bad plan, and a bad plan followed to the end is often both more rewarding & informative than switching at every timestep between many good plans. Any kind of multi-modality or need for extended plans (eg due to upfront costs/investments) will do it, and exploration is quite difficult—just taking the argmax or adding some randomness to action choices is not nearly enough, you need “deep exploration” (as Osband likes to call it) to follow a specific hypothesis to its limit. This is why you have things like ‘posterior sampling’ (generalization of Thompson sampling), where you randomly pick from your posterior of world-states and then follow the optimal strategy assuming that particular world state. (I cover this a bit in two of my recent essays, on startups & socks.)