The policy of truth is a blog post about why policy gradient/REINFORCE suck. I’m leaving a shortform comment because it seems like a classic example of wrong RL theory and philosophy, since reward is not the optimization target. Quotes:
Our goal remains to find a policy that maximizes the total reward after L time steps.
And hence the following is a general purpose algorithm for maximizing rewards with respect to parametric distributions:
If you start with a reward function whose values are in [0,1] and you subtract one million from each reward, this will increase the running time of the algorithm by a factor of a million, even though the ordering of the rewards amongst parameter values remains the same.
The latter is pretty easily understandable if you imagine each reward providing a policy gradient, and not the point of the algorithm to find the policy which happens to be an expected fix-point under a representative policy gradient (aka the “optimal” policy). Of course making all the rewards hugely negative will mess with your convergence properties! That’s very related to a much higher learning rate, and to only getting inexact gradients (due to negativity) instead of exact ones. Different dynamics.
Policy gradient approaches should be judged on whether they can train interesting policies doing what we want, and not whether they make reward go brr. Often these are related, but they are importantly not the same thing.
The policy of truth is a blog post about why policy gradient/REINFORCE suck. I’m leaving a shortform comment because it seems like a classic example of wrong RL theory and philosophy, since reward is not the optimization target. Quotes:
The latter is pretty easily understandable if you imagine each reward providing a policy gradient, and not the point of the algorithm to find the policy which happens to be an expected fix-point under a representative policy gradient (aka the “optimal” policy). Of course making all the rewards hugely negative will mess with your convergence properties! That’s very related to a much higher learning rate, and to only getting inexact gradients (due to negativity) instead of exact ones. Different dynamics.
Policy gradient approaches should be judged on whether they can train interesting policies doing what we want, and not whether they make reward go brr. Often these are related, but they are importantly not the same thing.