Sam Ringer comments on Models Don’t “Get Reward”

Sam Ringer 7 Jan 2023 15:10 UTC
1 point
0
Ah ok. If a reward function is taken as a preference ordering then you are right the model is optimizing for reward as the preference ranking is literally identical.
I think the reason we have been talking past each other is in my head when I think of “reward function” I am literally thinking of the reward function (i.e the actual code), and when I think of “reward maximiser” I think of a system that is trying to get that piece of code to output a high number.
So I guess it’s a case of us needing to be very careful by exactly what we mean by reward function, and my guess is as long as we use the same definition then we are in agreement? Does that make sense?
- DragonGod 7 Jan 2023 15:22 UTC
  3 points
  −1
  Parent
  It doesn’t have to be a preference ordering. My point was that depending on the level of detail at which you consider the reward function slightly different functions could be identical.
  
  I don’t think it makes sense to tie a reward function to a piece of code; a function can have multiple implementations.
  
  My contention is that if seems possible for the model’s objective function to be identical (at the level of detail we care about) to the reward function. In that case, I think the model is indistinguishable from a reward maximiser and it doesn’t make sense to say that it’s not a reward maximiser.