You got me curious, so I did some searching. This paper gives fairly tight bounds in the case where the payoffs are adaptive (i.e. can change in response to your previous actions) but bounded. The algorithm is on page 5.
Thanks for the link. Their algorithm, the “multiplicative update rule,” which goes about “selecting each arm randomly with probabilities that evolve based on their past performance,” does not seem to me to be the same strategy as Eliezer describes. So does this contradict his argument?
You got me curious, so I did some searching. This paper gives fairly tight bounds in the case where the payoffs are adaptive (i.e. can change in response to your previous actions) but bounded. The algorithm is on page 5.
Thanks for the link. Their algorithm, the “multiplicative update rule,” which goes about “selecting each arm randomly with probabilities that evolve based on their past performance,” does not seem to me to be the same strategy as Eliezer describes. So does this contradict his argument?
Yes.