“Optimization target” is itself a concept which needs deconfusing/operationalizing. For a certain definition of optimization and impact, I’ve found that the optimization is mostly correlated with reward, but that the learned policy will typically have more impact on the world/optimize the world more than is strictly necessary to achieve a given amount of reward.
This uses an empirical metric of impact/optimization which may or may not correlate well with algorithm-level measures of optimization targets.
“Optimization target” is itself a concept which needs deconfusing/operationalizing. For a certain definition of optimization and impact, I’ve found that the optimization is mostly correlated with reward, but that the learned policy will typically have more impact on the world/optimize the world more than is strictly necessary to achieve a given amount of reward.
This uses an empirical metric of impact/optimization which may or may not correlate well with algorithm-level measures of optimization targets.
https://www.alignmentforum.org/posts/qEwCitrgberdjjtuW/measuring-learned-optimization-in-small-transformer-models