When writing about RL, I find it helpful to disambiguate between:
A) “The policy optimizes the reward function” / “The reward function gets optimized” (this might happen but has to be reasoned about), and
B) “The reward function optimizes the policy” / “The policy gets optimized (by the reward function and the data distribution)” (this definitely happens, either directly—via eg REINFORCE—or indirectly, via an advantage estimator in PPO; B follows from the update equations)
When writing about RL, I find it helpful to disambiguate between:
A) “The policy optimizes the reward function” / “The reward function gets optimized” (this might happen but has to be reasoned about), and
B) “The reward function optimizes the policy” / “The policy gets optimized (by the reward function and the data distribution)” (this definitely happens, either directly—via eg REINFORCE—or indirectly, via an advantage estimator in PPO; B follows from the update equations)