The post explains difference between two similar-looking positions:
Model gets reward, and attempts to maximize it by selecting appropriate actions.
Model does some actions on the environment, and selection/mutation process calculates the reward to find out how to maximize it by modifying the model.
Those positions differ in level at which optimization pressure is applied. Actually, either of them can be implemented (so, the “reward” value can be given to the model), but common RL uses the second option.
The post explains difference between two similar-looking positions:
Model gets reward, and attempts to maximize it by selecting appropriate actions.
Model does some actions on the environment, and selection/mutation process calculates the reward to find out how to maximize it by modifying the model.
Those positions differ in level at which optimization pressure is applied. Actually, either of them can be implemented (so, the “reward” value can be given to the model), but common RL uses the second option.