I discussed this post recently with a colleague, who encouraged me to post this excerpt:
[Colleague] It seems like: 1. RL is in the business of finding optimal policies. (...)
[TurnTrout] I disagree, or at least think it’s not appropriate for it to be in that business these days. Reinforcement learning is, in my opinion, about learning from reinforcement, about how policy gradients accrue into interesting policies.
I think that a focus on optimal policies is a red herring and a stayover from the bygone age of tabular methods on tiny toy problems where policy iteration really does find the optimal policy, in reasonable time to boot.
I discussed this post recently with a colleague, who encouraged me to post this excerpt: