by definition “reward _must be_ what is optimized for by RL agents.”
This is not true, and the essay is meant to explain why. In vanilla policy gradient, reward R on a trajectory τ will provide a set of gradients which push up logits on the actions at which produced the trajectory. The gradient on the parameters θ which parameterize the policy πθ is in the direction of increasing return J:
Less formally, the agent does stuff. Some stuff is rewarding. Rewarding actions get upweighted locally. That’s it. There’s no math here that says “and the agent shall optimize for reward explicitly”; the math actually says “the agent’s parameterization is locally optimized by reward on the data distribution of the observations it actually makes.” Reward simply chisels cognition into agents (at least, in PG-style setups).
In some settings, convergence results guarantee that this process converges to an optimal policy. As explained in the section “When is reward the optimization target of the agent?”, these settings probably don’t bear on smart alignment-relevant agents operating in reality.
This is not true, and the essay is meant to explain why. In vanilla policy gradient, reward R on a trajectory τ will provide a set of gradients which push up logits on the actions at which produced the trajectory. The gradient on the parameters θ which parameterize the policy πθ is in the direction of increasing return J:
∇θJ(πθ)=Eτ∼πθ[T∑t=0∇θlogπθ(at∣st)R(τ)]You can read more about this here.
Less formally, the agent does stuff. Some stuff is rewarding. Rewarding actions get upweighted locally. That’s it. There’s no math here that says “and the agent shall optimize for reward explicitly”; the math actually says “the agent’s parameterization is locally optimized by reward on the data distribution of the observations it actually makes.” Reward simply chisels cognition into agents (at least, in PG-style setups).
In some settings, convergence results guarantee that this process converges to an optimal policy. As explained in the section “When is reward the optimization target of the agent?”, these settings probably don’t bear on smart alignment-relevant agents operating in reality.