You’re insanely fast. It’s quite humbling. I’ve been doing a course on Reinforcement Learning which is based on the same book, so I’ve been reading it, too. Got an exam coming up soon. Didn’t expect to see it mentioned on LW; I didn’t feel like it was on the same level as stuff in Miri’s research guide, though I do agree that it’s pretty good.
One thing I found a bit irritating was that, during the Multi-armed bandit chapter, it never mentioned what seems like the obvious approach: to do all the exploring at the beginning, and then only exploit for the remainder of the time (given that the problem is stationary). This should be far better than being ϵ-greedy. My approach would have been to start from that idea and see how it can be optimized. In a talk on AI Alignment, Eliezer actually briefly mentions the setting, and that’s exactly the approach he says is best.
It is sort of like the optimistic approach the book mentions at some point, where you assign unrealistically high expected values to all bandits, such that any usage will update their estimate downward and thus greedy selection will alternate between them in the beginning. But that seems like a clunky way to do it.
An aside: your link to the book below “Final Thoughts” is broken, there’s a bad ‘]’ at the end.
So I think that only works in environments which are both stationary and deterministic. Otherwise, you’d need to frontload an infinite number of trials for each arm to ensure you have precise estimates; anything less would incur infinite regret as there would be a nonzero probability you’d spend infinite time exploiting a suboptimal arm. This reminds me of conditional convergence, where you can’t always rearrange the components of a series and have it still converge to the same sum / at all. I think interleaving exploration and exploitation such that you minimize your regret is the best way to go here.
This more or less regresses to an offline supervised learning model in which a bunch of samples are collected upfront, a model trained, and then used to predict all future actions. While you might be framing your problem as an MDP, you’re not doing reinforcement learning in this case. As TurnTrout mentioned in a sibling to this comment it works only in the stationary & deterministic environments which represent toy problems for the space, but ultimately the goal for RL is to function in non-stationary non-deterministic environments so it makes little sense to focus on this path.
You’re insanely fast. It’s quite humbling. I’ve been doing a course on Reinforcement Learning which is based on the same book, so I’ve been reading it, too. Got an exam coming up soon. Didn’t expect to see it mentioned on LW; I didn’t feel like it was on the same level as stuff in Miri’s research guide, though I do agree that it’s pretty good.
One thing I found a bit irritating was that, during the Multi-armed bandit chapter, it never mentioned what seems like the obvious approach: to do all the exploring at the beginning, and then only exploit for the remainder of the time (given that the problem is stationary). This should be far better than being ϵ-greedy. My approach would have been to start from that idea and see how it can be optimized. In a talk on AI Alignment, Eliezer actually briefly mentions the setting, and that’s exactly the approach he says is best.
It is sort of like the optimistic approach the book mentions at some point, where you assign unrealistically high expected values to all bandits, such that any usage will update their estimate downward and thus greedy selection will alternate between them in the beginning. But that seems like a clunky way to do it.
An aside: your link to the book below “Final Thoughts” is broken, there’s a bad ‘]’ at the end.
So I think that only works in environments which are both stationary and deterministic. Otherwise, you’d need to frontload an infinite number of trials for each arm to ensure you have precise estimates; anything less would incur infinite regret as there would be a nonzero probability you’d spend infinite time exploiting a suboptimal arm. This reminds me of conditional convergence, where you can’t always rearrange the components of a series and have it still converge to the same sum / at all. I think interleaving exploration and exploitation such that you minimize your regret is the best way to go here.
This more or less regresses to an offline supervised learning model in which a bunch of samples are collected upfront, a model trained, and then used to predict all future actions. While you might be framing your problem as an MDP, you’re not doing reinforcement learning in this case. As TurnTrout mentioned in a sibling to this comment it works only in the stationary & deterministic environments which represent toy problems for the space, but ultimately the goal for RL is to function in non-stationary non-deterministic environments so it makes little sense to focus on this path.