TurnTrout comments on TurnTrout’s shortform feed

TurnTrout 13 Nov 2023 21:07 UTC
LW: 19 AF: 8
6
AF
When writing about RL, I find it helpful to disambiguate between:

A) “The policy optimizes the reward function” / “The reward function gets optimized” (this might happen but has to be reasoned about), and
B) “The reward function optimizes the policy” / “The policy gets optimized (by the reward function and the data distribution)” (this definitely happens, either directly—via eg REINFORCE—or indirectly, via an advantage estimator in PPO; B follows from the update equations)