Happens all the time in decision theory & reinforcement learning: the average of many good plans is often a bad plan, and a bad plan followed to the end is often both more rewarding & informative than switching at every timestep between many good plans. Any kind of multi-modality or need for extended plans (eg due to upfront costs/investments) will do it, and exploration is quite difficult—just taking the argmax or adding some randomness to action choices is not nearly enough, you need “deep exploration” (as Osband likes to call it) to follow a specific hypothesis to its limit. This is why you have things like ‘posterior sampling’ (generalization of Thompson sampling), where you randomly pick from your posterior of world-states and then follow the optimal strategy assuming that particular world state. (I cover this a bit in two of my recent essays, on startups & socks.)
When there’s no clear winner, the winner can’t take all.
https://en.wikipedia.org/wiki/Winner-take-all_in_action_selection
Happens all the time in decision theory & reinforcement learning: the average of many good plans is often a bad plan, and a bad plan followed to the end is often both more rewarding & informative than switching at every timestep between many good plans. Any kind of multi-modality or need for extended plans (eg due to upfront costs/investments) will do it, and exploration is quite difficult—just taking the argmax or adding some randomness to action choices is not nearly enough, you need “deep exploration” (as Osband likes to call it) to follow a specific hypothesis to its limit. This is why you have things like ‘posterior sampling’ (generalization of Thompson sampling), where you randomly pick from your posterior of world-states and then follow the optimal strategy assuming that particular world state. (I cover this a bit in two of my recent essays, on startups & socks.)