Unsurprisingly, the agent learns to first explore, and then exploit the best arm. This is a simple consequence of the fact that you have to look at observations to figure out what to do; this is no different from the fact that a DQN playing Pong will look at where the ball is in order to figure out what action to take.
Fwiw, I agree with this, and also I think this is the same thing as what I said in my comment on the post regarding how this is a necessary consequence of the RL algorithm only updating the model after it takes actions.
I had a similar confusion when I first read Evan’s comment. I think the thing that obscures this discussion is the extent to which the word ‘learning’ is overloaded—so I’d vote taboo the term and use more concrete language.
Fwiw, I agree with this, and also I think this is the same thing as what I said in my comment on the post regarding how this is a necessary consequence of the RL algorithm only updating the model after it takes actions.
I didn’t understand what you meant by “requires learning”, but yeah I think you are in fact saying the same thing.
I had a similar confusion when I first read Evan’s comment. I think the thing that obscures this discussion is the extent to which the word ‘learning’ is overloaded—so I’d vote taboo the term and use more concrete language.