Furthermore, it should be obvious that any learned goal will not be “get more reward”, but something else. The model doesn’t even see the reward!
Is this probabilistically true or necessarily true?
If the reward function is simple enough that some models in the selection space already optimise that function, then eventually iterated selection for performance across that function will select for models that are reward maximisers (in addition to models that just behave as if they were reward maximisers).
This is an interesting point. I can imagine a case where our assigned reward comes from a simple function (e.g reward = number of letter ‘e’s in output) and we also have a model which is doing some internal optimization to maximise the number of ‘e’s produced in its output, so it is “goal-directed to produce lots of ’e’s”.
Even in this case, I would still say this model isn’t a “reward maximiser”. It is a “letter ‘e’ maximiser”.
(I also want to acknowledge that thinking this through makes me feel somewhat confused. I think what I said is correct. My guess is the misunderstanding I highlight in the post is quite pervasive, and the language we use isn’t current up-to-scratch to write about these things clearly. Good job thinking of a case that is pushing against my understanding!)
Even in this case, I would still say this model isn’t a “reward maximiser”. It is a “letter ‘e’ maximiser”.
But “reward” is governed by the number of letter ’e’s in output! If the objective function that the model optimises for, and the reward function are identical[1], then saying the model is not a reward maximiser seems to me like a distinction without a difference.
(Epistemic status: you are way more technically informed than this on me, I’m just trying to follow your reasoning.)
Modulo transformations that preserve the properties we’re interested in (e.g. consider a utility function as solely representing a preference ordering the preference ordering is left unchanged by linear transformations [just scaling up all utilities uniformly or adding the same constant to all utilities preserves the ordering]).
Ah ok. If a reward function is taken as a preference ordering then you are right the model is optimizing for reward as the preference ranking is literally identical.
I think the reason we have been talking past each other is in my head when I think of “reward function” I am literally thinking of the reward function (i.e the actual code), and when I think of “reward maximiser” I think of a system that is trying to get that piece of code to output a high number.
So I guess it’s a case of us needing to be very careful by exactly what we mean by reward function, and my guess is as long as we use the same definition then we are in agreement? Does that make sense?
It doesn’t have to be a preference ordering. My point was that depending on the level of detail at which you consider the reward function slightly different functions could be identical.
I don’t think it makes sense to tie a reward function to a piece of code; a function can have multiple implementations.
My contention is that if seems possible for the model’s objective function to be identical (at the level of detail we care about) to the reward function. In that case, I think the model is indistinguishable from a reward maximiser and it doesn’t make sense to say that it’s not a reward maximiser.
Is this probabilistically true or necessarily true?
If the reward function is simple enough that some models in the selection space already optimise that function, then eventually iterated selection for performance across that function will select for models that are reward maximisers (in addition to models that just behave as if they were reward maximisers).
This particular statement seems too strong.
This is an interesting point. I can imagine a case where our assigned reward comes from a simple function (e.g reward = number of letter ‘e’s in output) and we also have a model which is doing some internal optimization to maximise the number of ‘e’s produced in its output, so it is “goal-directed to produce lots of ’e’s”.
Even in this case, I would still say this model isn’t a “reward maximiser”. It is a “letter ‘e’ maximiser”.
(I also want to acknowledge that thinking this through makes me feel somewhat confused. I think what I said is correct. My guess is the misunderstanding I highlight in the post is quite pervasive, and the language we use isn’t current up-to-scratch to write about these things clearly. Good job thinking of a case that is pushing against my understanding!)
But “reward” is governed by the number of letter ’e’s in output! If the objective function that the model optimises for, and the reward function are identical[1], then saying the model is not a reward maximiser seems to me like a distinction without a difference.
(Epistemic status: you are way more technically informed than this on me, I’m just trying to follow your reasoning.)
Modulo transformations that preserve the properties we’re interested in (e.g. consider a utility function as solely representing a preference ordering the preference ordering is left unchanged by linear transformations [just scaling up all utilities uniformly or adding the same constant to all utilities preserves the ordering]).
Ah ok. If a reward function is taken as a preference ordering then you are right the model is optimizing for reward as the preference ranking is literally identical.
I think the reason we have been talking past each other is in my head when I think of “reward function” I am literally thinking of the reward function (i.e the actual code), and when I think of “reward maximiser” I think of a system that is trying to get that piece of code to output a high number.
So I guess it’s a case of us needing to be very careful by exactly what we mean by reward function, and my guess is as long as we use the same definition then we are in agreement? Does that make sense?
It doesn’t have to be a preference ordering. My point was that depending on the level of detail at which you consider the reward function slightly different functions could be identical.
I don’t think it makes sense to tie a reward function to a piece of code; a function can have multiple implementations.
My contention is that if seems possible for the model’s objective function to be identical (at the level of detail we care about) to the reward function. In that case, I think the model is indistinguishable from a reward maximiser and it doesn’t make sense to say that it’s not a reward maximiser.