So I think that only works in environments which are both stationary and deterministic. Otherwise, you’d need to frontload an infinite number of trials for each arm to ensure you have precise estimates; anything less would incur infinite regret as there would be a nonzero probability you’d spend infinite time exploiting a suboptimal arm. This reminds me of conditional convergence, where you can’t always rearrange the components of a series and have it still converge to the same sum / at all. I think interleaving exploration and exploitation such that you minimize your regret is the best way to go here.
So I think that only works in environments which are both stationary and deterministic. Otherwise, you’d need to frontload an infinite number of trials for each arm to ensure you have precise estimates; anything less would incur infinite regret as there would be a nonzero probability you’d spend infinite time exploiting a suboptimal arm. This reminds me of conditional convergence, where you can’t always rearrange the components of a series and have it still converge to the same sum / at all. I think interleaving exploration and exploitation such that you minimize your regret is the best way to go here.