TurnTrout comments on Reward is not the optimization target

TurnTrout 7 Nov 2022 22:45 UTC
LW: 3 AF: 2
1
AF
by definition “reward _must be_ what is optimized for by RL agents.”
This is not true, and the essay is meant to explain why. In vanilla policy gradient, reward $R$ on a trajectory $τ$ will provide a set of gradients which push up logits on the actions $a_{t}$ which produced the trajectory. The gradient on the parameters $θ$ which parameterize the policy $π_{θ}$ is in the direction of increasing return $J$ :
$\nabla_{θ} J (π_{θ}) = E τ \sim π_{θ} [T \sum t = 0 \nabla_{θ} log π_{θ} (a_{t} ∣ s_{t}) R (τ)]$
You can read more about this here.
Less formally, the agent does stuff. Some stuff is rewarding. Rewarding actions get upweighted locally. That’s it. There’s no math here that says “and the agent shall optimize for reward explicitly”; the math actually says “the agent’s parameterization is locally optimized by reward on the data distribution of the observations it actually makes.” Reward simply chisels cognition into agents (at least, in PG-style setups).
In some settings, convergence results guarantee that this process converges to an optimal policy. As explained in the section “When is reward the optimization target of the agent?”, these settings probably don’t bear on smart alignment-relevant agents operating in reality.