Ah ok. If a reward function is taken as a preference ordering then you are right the model is optimizing for reward as the preference ranking is literally identical.
I think the reason we have been talking past each other is in my head when I think of “reward function” I am literally thinking of the reward function (i.e the actual code), and when I think of “reward maximiser” I think of a system that is trying to get that piece of code to output a high number.
So I guess it’s a case of us needing to be very careful by exactly what we mean by reward function, and my guess is as long as we use the same definition then we are in agreement? Does that make sense?
This post (and the author’s comments) don’t seem to be getting a great response and I’m confused why? The post seems pretty reasonable and the author’s comments are well informed.
My read of the main thrust is “don’t concentrate on a specific paradigm and instead look at this trend that has held for over 100 years”.
Can someone concisely explain why they think this is misguided? Is it just concerns over the validity of fitting parameters for a super-exponential model?
(I would also add that on priors when people claim “There is no way we can improve FLOPs/$ because of reasons XYZ” they have historically always been wrong.)