In model-free RL, policy-based methods choose policies by optimizing a noisy estimate of the policy’s value. This is analogous to optimizing a noisy estimate of prediction accuracy (i.e., accuracy on the training data) to choose a predictive model. So we often need to trade variance for bias in the policy-learning case (i.e., shrink towards simpler policies) just as in the predictive modeling case.
In model-free RL, policy-based methods choose policies by optimizing a noisy estimate of the policy’s value. This is analogous to optimizing a noisy estimate of prediction accuracy (i.e., accuracy on the training data) to choose a predictive model. So we often need to trade variance for bias in the policy-learning case (i.e., shrink towards simpler policies) just as in the predictive modeling case.