See footnote 5 for a nearby argument which I think is valid:
The strongest argument for reward-maximization which I’m aware of is: Human brains do RL and often care about some kind of tight reward-correlate, to some degree. Humans are like deep learning systems in some ways, and so that’s evidence that “learning setups which work in reality” can come to care about their own training signals.
See footnote 5 for a nearby argument which I think is valid: