DragonGod comments on Models Don’t “Get Reward”

DragonGod 7 Jan 2023 12:25 UTC
2 points
0
Even in this case, I would still say this model isn’t a “reward maximiser”. It is a “letter ‘e’ maximiser”.
But “reward” is governed by the number of letter ’e’s in output! If the objective function that the model optimises for, and the reward function are identical^[1], then saying the model is not a reward maximiser seems to me like a distinction without a difference.
(Epistemic status: you are way more technically informed than this on me, I’m just trying to follow your reasoning.)
1. ^
  Modulo transformations that preserve the properties we’re interested in (e.g. consider a utility function as solely representing a preference ordering the preference ordering is left unchanged by linear transformations [just scaling up all utilities uniformly or adding the same constant to all utilities preserves the ordering]).
- Sam Ringer 7 Jan 2023 15:10 UTC
  1 point
  0
  Parent
  Ah ok. If a reward function is taken as a preference ordering then you are right the model is optimizing for reward as the preference ranking is literally identical.
  I think the reason we have been talking past each other is in my head when I think of “reward function” I am literally thinking of the reward function (i.e the actual code), and when I think of “reward maximiser” I think of a system that is trying to get that piece of code to output a high number.
  So I guess it’s a case of us needing to be very careful by exactly what we mean by reward function, and my guess is as long as we use the same definition then we are in agreement? Does that make sense?
  - DragonGod 7 Jan 2023 15:22 UTC
    3 points
    −1
    Parent
    It doesn’t have to be a preference ordering. My point was that depending on the level of detail at which you consider the reward function slightly different functions could be identical.
    
    I don’t think it makes sense to tie a reward function to a piece of code; a function can have multiple implementations.
    
    My contention is that if seems possible for the model’s objective function to be identical (at the level of detail we care about) to the reward function. In that case, I think the model is indistinguishable from a reward maximiser and it doesn’t make sense to say that it’s not a reward maximiser.