[Question] When is reward ever the optimization target?

Alright, I have a question stemming from TurnTrout’s post on Reward is not the optimization target, where he argues that the premises that are required to get to the conclusion of reward being the optimization target are so narrowly applicable as to not apply to future RL AIs as they gain more and more power:

https://​​www.lesswrong.com/​​posts/​​pdaGN6pQyQarFHXF4/​​reward-is-not-the-optimization-target#When_is_reward_the_optimization_target_of_the_agent_

But @gwern argued with Turntrout that reward is in fact the optimization target for a broad range of RL algorithms:

https://​​www.lesswrong.com/​​posts/​​ttmmKDTkzuum3fftG/​​#sdCdLw3ggRxYik385

https://​​www.lesswrong.com/​​posts/​​nmxzr2zsjNtjaHh7x/​​actually-othello-gpt-has-a-linear-emergent-world#Tdo7S62iaYwfBCFxL

So my question is are there known results, ideally proofs, but I can accept empirical studies if necessary that show when RL algorithms treat the reward function as an optimization target?

And how narrow is the space of RL algorithms that don’t optimize for the reward function?

A good answer will link to results known in the RL literature that are relevant to the question, and give conditions under which a RL agent does or doesn’t optimize the reward function.

The best answers will present either finite-time results on RL algorithms optimizing the reward function, or argue that the infinite limit abstraction is a reasonable approximation to the actual reality of RL algorithms.

I’d like to know which RL algorithms optimize the reward, and which do not.

No comments.