That’s the training signal, not the utility function. Those are different things. (I believe this point was made in Reward is not the Optimization Target, though I could be wrong since I never actually read this post; corrections welcome.)
That’s the training signal, not the utility function. Those are different things. (I believe this point was made in Reward is not the Optimization Target, though I could be wrong since I never actually read this post; corrections welcome.)