It seems to me that the basic conceptual point made in this post is entirely contained in our Risks from Learned Optimization paper. I might just be missing a point. You’ve certainly phrased things differently and made some specific points that we didn’t, but am I just misunderstanding something if I think the basic conceptual claims of this post (which seems to be presented as new) are implied by RFLO? If not, could you state briefly what is different?
(Note I am still surprised sometimes that people still think certain wireheading scenario’s make sense despite them having read RFLO, so it’s plausible to me that we really didn’t communicate everyrhing that’s in my head about this).
“Wireheading is improbable” is only half of the point of the essay.
The other main point is “reward functions are not the same type of object as utility functions.” I haven’t reread all of RFLO recently, but on a skim—RFLO consistently talks about reward functions as “objectives”:
The particular type of robustness problem that mesa-optimization falls into is the reward-result gap, the gap between the reward for which the system was trained (the base objective) and the reward that can be reconstructed from it using inverse reinforcement learning (the behavioral objective).
...
The assumption in that work is that a monotonic relationship between the learned reward and true reward indicates alignment, whereas deviations from that suggest misalignment. Building on this sort of research, better theoretical measures of alignment might someday allow us to speak concretely in terms of provable guarantees about the extent to which a mesa-optimizer is aligned with the base optimizer that created it.
Which is reasonable parlance, given that everyone else uses it, but I don’t find that terminology very useful for thinking about what kinds of inner cognition will be developed in the network. Reward functions + environmental data provides a series of cognitive-updates to the network, in the form of reinforcement schedules. The reward function is not necessarily an ‘objective’ at all.
(You might have privately known about this distinction. Fine by me! But I can’t back it out from a skim of RFLO, even already knowing the insight and looking for it.)
Maybe you have made a gestalt-switch I haven’t made yet, or maybe yours is a better way to communicate the same thing, but: the way I think of it is that the reward function is just a function from states to numbers, and the way the information contained in the reward function affects the model parameters is via reinforcement of pre-existing computations.
Is there a difference between saying:
A reward function is an objective function, but the only way that it affects behaviour is via reinforcement of pre-existing computations in the model, and it doesn’t actually encode in any way the “goal” of the model itself.
A reward function is not an objective function, and the only way that it affects behaviour is via reinforcement of pre-existing computations in the model, and it doesn’t actually encode in any way the “goal” of the model itself.
It seems to me that once you acknowledge the point about reinforcement, the additional statement that reward is not an objective doesn’t actually imply anything further about the mechanistic properties of deep reinforcement learners? It is just a way to put a high-level conceptual story on top of it, and in this sense it seems to me that this point is already known (and in particular, contained within RFLO), even though we talked of the base objective still as an “objective”.
However, it might be that while RFLO pointed out the same mechanistic understanding that you have in mind, but calling it an objective tends in practice to not fully communicate that mechanistic understanding.
Or it might be that I am really not yet understanding that there is an actual diferrence in mechanistic understanding, or that my intuitions are still being misled by the wrong high-level concept even if I have the lower-level mechanistic understanding right.
(On the other hand, one reason to still call it an objective is because we really can think of the selection process, i.e. evolution/the learning algorithm of an RL agent, as having an objective but making imperfect choices, or we can think of the training objective as encoding a task that humans have in mind).
in this sense it seems to me that this point is already known (and in particular, contained within RFLO), even though we talked of the base objective still as an “objective”.
Where did RFLO point it out? RFLO talks about a mesa objective being different from the “base objective” (even though reward is not a kind of objective). IIRC on my skim most of the arguments were non-mechanistic reasoning about what gets selected for. (Which isn’t a knockdown complaint, but those arguments are also not about the mechanism.) Also see my comment to Evan.
Like, from my POV, people are reliably reasoning about what RL “selects for” via “lots of optimization pressure” on “high reward by the formal metric”, but who’s reasoning about what kinds of antecedent computations get reinforced when credit assignment activates? Can you give me examples of anyone else spelling this out in a straightforward fashion?
calling it an objective tends in practice to not fully communicate that mechanistic understanding.
Yeah, I think it just doesn’t communicate the mechanistic understanding (not even imperfectly, in most cases, I imagine). From my current viewpoint, I just wouldn’t call reward an objective at all, except in the context of learned antecedent-computation-reinforcement terminal values. It’s like if I said “My cake is red” when the cake is blue, I guess? IMO it’s just not how to communicate the concept.
On the other hand, one reason to still call it an objective is because we really can think of the selection process, i.e. evolution/the learning algorithm of an RL agent, as having an objective but making imperfect choices, or we can think of the training objective as encoding a task that humans have in mind
“even though reward is not a kind of objective”, this is a terminological issue. In my view, calling a “antecedent-computation reinforcement criterion” an “objective” matches my definition of “objective”, and this is just a matter of terminology. The term “objective” is ill-defined enough that “even though reward is not a kind of objective” is a terminological claim about objective, not a claim about math/the world.
The idea that RL agents “reinforce antecedent computations” is completely core to our story of deception. You could not make sense of our argument for deception if you didn’t look at RL systems in this way. Viewing the base optimizer as “trying” to achieve an “objective” but “failing” because it is being “deceived” by the mesa optimizer is purely a metaphorical/terminological choice. It doesn’t negate the fact that we all understood that the base optimizer is just reinforcing “antecedent computations”. How else could you make sense of the story of deception, where an existing model, which represents the mesa optimizer, is being reinforced by the base optimizer because that existing model understands the base optimizer’s optimization process?
I am not claiming that the RFLO communicated this point well, just that it was understood and absolutely was core to the paper, and large parts of the paper wouldn’t even make sense if you didn’t have this insight. (Certainly the fact that we called it an objective doesn’t communicate the point, and it isn’t meant to).
I am not claiming that the RFLO communicated this point well, just that it was understood and absolutely was core to the paper, and large parts of the paper wouldn’t even make sense if you didn’t have this insight.
I think most ML practitioners do have implicit models of how reward chisels computation into agents, as seen with how they play around with e.g. reward shaping and such. It’s that I don’t perceive this knowledge to be engaged when some people reason about “optimization processes” and “selecting for high-reward models” on e.g. LW.
I just continue to think “I wouldn’t write RFLO the way it was written, if I had deeply and consciously internalized the lessons of OP”, but it’s possible this is a terminological/framing thing. Your comment does update me some, but I think I mostly retain my view here. I do totally buy that you all had good implicit models of the reward-chiseling point.
FWIW, I think a bunch of my historical frustration here has been an experience of:
Pointing out the “reward chisels computation” point
Having some people tell me it’s obvious, or already known, or that they already invented it
Seeing some of the same people continue making similar mistakes (according to me)
Not finding instances of other people making these points before OP
Continuing (AFAICT) to correct people on (what I claim to be) mistakes around reward and optimization targets, and (for a while) was ~the only one doing so.
If I found several comments explaining what is clearly the “reward chisels computation” point, where the comments were posted before this post, by people who weren’t me or downstream of my influence, I would update against my points being novel and towards my points using different terminology.
IIRC there’s one comment from Wei_Dai from a few years back in this vein, but IDK of others.
Person A has mental model X and tries to explain X with explanation Q
Person B doesn’t get model X from Q, thinks a bit, and then writes explanation P, reads P and thinks: P is how it should have been explained all along, and Q didn’t actually contain the insights, but P does.
Person C doesn’t get model X from P, thinks a bit, and then writes explanation R, reads R and thinks: …
It seems to me quite likely that you are person B, thinking they explained something because THEY think their explanation is very good and contains all the insights that the previous ones didn’t. Some of the evidence for this is in fact contained in your very comment:
“1. Pointing out the “reward chisels computation” point. 2. Having some people tell me it’s obvious, or already known, or that they already invented it. 3. Seeing some of the same people continue making similar mistakes (according to me)” So point 3 basically almost definitively proves that your mental model is not conveyed to those people in your post, does it not? I think a similar thing happened where that mental model was not conveyed to you from RFLO, even though we tried to convey it. (btw not saying the models that RFLO tried to explain are the same as this post, but the basic idea of this post definitely is a part of RFLO).
BTW, it could in fact be that person B’s explanation is clearer. (otoh, I think some things are less clear, e.g. you talk about “the” optimization target, which I would say is referring to that of the mesa-optimizer, without clearly assuming there is a mesa-optimizer. We stated the terms mesa- and base-optimizer to clearly make the distinction. There are a bunch of other things that I think are just imprecise, but let’s not get into it).
“Continuing (AFAICT) to correct people on (what I claim to be) mistakes around reward and optimization targets, and (for a while) was ~the only one doing so.”
I have been correcting people for a while on stuff like that (though not on LW, I’m not often on LW), such as that in the generic case we shouldn’t expect wireheading from RL agents unless the option of wireheading is in the training environment, for basically these reasons. I would also have expected people to just get this after reading RFLO, but many didn’t (others did), so your points 1/2/3 also apply to me.
“I do totally buy that you all had good implicit models of the reward-chiseling point”. I don’t think we just “implicitly” modeled it, we very explicitly understood it and it ran throughout our whole thinking about the topic. Again, explaining stuff is hard though, I’m not claiming we conveyed everything well to everyone (clearly you haven’t either).
I want to note that I just reread Utility ≠ Reward and was pleasantly surprised by its treatment, as well as the hedges. I’m making an upwards update on these points having been understood by at least some thinkers, although I’ve also made a lot of downward updates for other reasons.
Thanks for this comment! I think it makes some sense (but would have been easier to read given meaningful variable names).
Bob’s alignment strategy is that he wants X = X1 = Y = Y1 = Z = Z1. Also he wants the end result to be an agent whose good behaviours (Z) are in fact maximising a utility function at all (in this case, Z1).
I either don’t understand the semantics of “=” here, or I disagree. Bob’s strategy doesn’t make sense because X and Z have type behavior, X1 and Z1 have type utility function, Y is some abstract reward function over some mathematical domain, Y1 is an empirical set of reinforcement events.
It still seems to me like there is an error being made, such that Bob and Carol aren’t just trying to do different things or using different terminology, but that also Bob’s alignment strategy isn’t type-sensible or -coherent.
Reward functions often are structured as objectives, which is why we talk about them that way. In most situations, if you had access to e.g. AIXI, you could directly build a “reward maximizer.”
I agree that this is not always the case, though, as in the discussion here. That being said, I think it is often enough the case that it made sense to focus on that particular case in RFLO.
Reward functions often are structured as objectives
What does this mean? By “structured as objectives”, do you mean something like “people try to express what they want with a reward function, by conferring more reward to more desirable states”? (I’m going to assume so for the rest of the comment, LMK if this is wrong.)
I agree that other people (especially my past self) think about reward functions this way. I think they’re generally wrong to do so, and it’s misleading as to the real nature of the alignment problem.
I agree that this is not always the case, though, as in the discussion here.
I agree with that post, thanks for linking.
if you had access to e.g. AIXI, you could directly build a “reward maximizer.”
I agree that this is not always the case, though, as in the discussion here. That being said, I think it is often enough the case that it made sense to focus on that particular case in RFLO.
As far as I can tell, AIXI and other hardcoded planning agents are the known exceptions to the arguments in this post. We will not get AGI via these approaches. When else is it the case? I therefore still feel confused why you think it made sense.
While I definitely appreciate the work you all did with RFLO, the framing of reward as a “base objective” seems like a misstep that set discourse in a weird direction which I’m trying to push back on (from my POV!). I think that the “base objective” is better described as a “cognitive-update-generator.” (This is not me trying to educate you on this specific point, but rather argue that it really matters how we frame the problem in our day-to-day reasoning.)
It seems to me that the basic conceptual point made in this post is entirely contained in our Risks from Learned Optimization paper. I might just be missing a point. You’ve certainly phrased things differently and made some specific points that we didn’t, but am I just misunderstanding something if I think the basic conceptual claims of this post (which seems to be presented as new) are implied by RFLO? If not, could you state briefly what is different?
(Note I am still surprised sometimes that people still think certain wireheading scenario’s make sense despite them having read RFLO, so it’s plausible to me that we really didn’t communicate everyrhing that’s in my head about this).
“Wireheading is improbable” is only half of the point of the essay.
The other main point is “reward functions are not the same type of object as utility functions.” I haven’t reread all of RFLO recently, but on a skim—RFLO consistently talks about reward functions as “objectives”:
Which is reasonable parlance, given that everyone else uses it, but I don’t find that terminology very useful for thinking about what kinds of inner cognition will be developed in the network. Reward functions + environmental data provides a series of cognitive-updates to the network, in the form of reinforcement schedules. The reward function is not necessarily an ‘objective’ at all.
(You might have privately known about this distinction. Fine by me! But I can’t back it out from a skim of RFLO, even already knowing the insight and looking for it.)
Maybe you have made a gestalt-switch I haven’t made yet, or maybe yours is a better way to communicate the same thing, but: the way I think of it is that the reward function is just a function from states to numbers, and the way the information contained in the reward function affects the model parameters is via reinforcement of pre-existing computations.
Is there a difference between saying:
A reward function is an objective function, but the only way that it affects behaviour is via reinforcement of pre-existing computations in the model, and it doesn’t actually encode in any way the “goal” of the model itself.
A reward function is not an objective function, and the only way that it affects behaviour is via reinforcement of pre-existing computations in the model, and it doesn’t actually encode in any way the “goal” of the model itself.
It seems to me that once you acknowledge the point about reinforcement, the additional statement that reward is not an objective doesn’t actually imply anything further about the mechanistic properties of deep reinforcement learners? It is just a way to put a high-level conceptual story on top of it, and in this sense it seems to me that this point is already known (and in particular, contained within RFLO), even though we talked of the base objective still as an “objective”.
However, it might be that while RFLO pointed out the same mechanistic understanding that you have in mind, but calling it an objective tends in practice to not fully communicate that mechanistic understanding.
Or it might be that I am really not yet understanding that there is an actual diferrence in mechanistic understanding, or that my intuitions are still being misled by the wrong high-level concept even if I have the lower-level mechanistic understanding right.
(On the other hand, one reason to still call it an objective is because we really can think of the selection process, i.e. evolution/the learning algorithm of an RL agent, as having an objective but making imperfect choices, or we can think of the training objective as encoding a task that humans have in mind).
Where did RFLO point it out? RFLO talks about a mesa objective being different from the “base objective” (even though reward is not a kind of objective). IIRC on my skim most of the arguments were non-mechanistic reasoning about what gets selected for. (Which isn’t a knockdown complaint, but those arguments are also not about the mechanism.) Also see my comment to Evan.
Like, from my POV, people are reliably reasoning about what RL “selects for” via “lots of optimization pressure” on “high reward by the formal metric”, but who’s reasoning about what kinds of antecedent computations get reinforced when credit assignment activates? Can you give me examples of anyone else spelling this out in a straightforward fashion?
Yeah, I think it just doesn’t communicate the mechanistic understanding (not even imperfectly, in most cases, I imagine). From my current viewpoint, I just wouldn’t call reward an objective at all, except in the context of learned antecedent-computation-reinforcement terminal values. It’s like if I said “My cake is red” when the cake is blue, I guess? IMO it’s just not how to communicate the concept.
Why is this reasonable?
Very late reply, sorry.
“even though reward is not a kind of objective”, this is a terminological issue. In my view, calling a “antecedent-computation reinforcement criterion” an “objective” matches my definition of “objective”, and this is just a matter of terminology. The term “objective” is ill-defined enough that “even though reward is not a kind of objective” is a terminological claim about objective, not a claim about math/the world.
The idea that RL agents “reinforce antecedent computations” is completely core to our story of deception. You could not make sense of our argument for deception if you didn’t look at RL systems in this way. Viewing the base optimizer as “trying” to achieve an “objective” but “failing” because it is being “deceived” by the mesa optimizer is purely a metaphorical/terminological choice. It doesn’t negate the fact that we all understood that the base optimizer is just reinforcing “antecedent computations”. How else could you make sense of the story of deception, where an existing model, which represents the mesa optimizer, is being reinforced by the base optimizer because that existing model understands the base optimizer’s optimization process?
I am not claiming that the RFLO communicated this point well, just that it was understood and absolutely was core to the paper, and large parts of the paper wouldn’t even make sense if you didn’t have this insight. (Certainly the fact that we called it an objective doesn’t communicate the point, and it isn’t meant to).
I think most ML practitioners do have implicit models of how reward chisels computation into agents, as seen with how they play around with e.g. reward shaping and such. It’s that I don’t perceive this knowledge to be engaged when some people reason about “optimization processes” and “selecting for high-reward models” on e.g. LW.
I just continue to think “I wouldn’t write RFLO the way it was written, if I had deeply and consciously internalized the lessons of OP”, but it’s possible this is a terminological/framing thing. Your comment does update me some, but I think I mostly retain my view here. I do totally buy that you all had good implicit models of the reward-chiseling point.
FWIW, I think a bunch of my historical frustration here has been an experience of:
Pointing out the “reward chisels computation” point
Having some people tell me it’s obvious, or already known, or that they already invented it
Seeing some of the same people continue making similar mistakes (according to me)
Not finding instances of other people making these points before OP
Continuing (AFAICT) to correct people on (what I claim to be) mistakes around reward and optimization targets, and (for a while) was ~the only one doing so.
If I found several comments explaining what is clearly the “reward chisels computation” point, where the comments were posted before this post, by people who weren’t me or downstream of my influence, I would update against my points being novel and towards my points using different terminology.
IIRC there’s one comment from Wei_Dai from a few years back in this vein, but IDK of others.
There is a general phenomenon where:
Person A has mental model X and tries to explain X with explanation Q
Person B doesn’t get model X from Q, thinks a bit, and then writes explanation P, reads P and thinks: P is how it should have been explained all along, and Q didn’t actually contain the insights, but P does.
Person C doesn’t get model X from P, thinks a bit, and then writes explanation R, reads R and thinks: …
It seems to me quite likely that you are person B, thinking they explained something because THEY think their explanation is very good and contains all the insights that the previous ones didn’t. Some of the evidence for this is in fact contained in your very comment:
“1. Pointing out the “reward chisels computation” point. 2. Having some people tell me it’s obvious, or already known, or that they already invented it. 3. Seeing some of the same people continue making similar mistakes (according to me)”
So point 3 basically almost definitively proves that your mental model is not conveyed to those people in your post, does it not? I think a similar thing happened where that mental model was not conveyed to you from RFLO, even though we tried to convey it. (btw not saying the models that RFLO tried to explain are the same as this post, but the basic idea of this post definitely is a part of RFLO).
BTW, it could in fact be that person B’s explanation is clearer. (otoh, I think some things are less clear, e.g. you talk about “the” optimization target, which I would say is referring to that of the mesa-optimizer, without clearly assuming there is a mesa-optimizer. We stated the terms mesa- and base-optimizer to clearly make the distinction. There are a bunch of other things that I think are just imprecise, but let’s not get into it).
“Continuing (AFAICT) to correct people on (what I claim to be) mistakes around reward and optimization targets, and (for a while) was ~the only one doing so.”
I have been correcting people for a while on stuff like that (though not on LW, I’m not often on LW), such as that in the generic case we shouldn’t expect wireheading from RL agents unless the option of wireheading is in the training environment, for basically these reasons. I would also have expected people to just get this after reading RFLO, but many didn’t (others did), so your points 1/2/3 also apply to me.
“I do totally buy that you all had good implicit models of the reward-chiseling point”. I don’t think we just “implicitly” modeled it, we very explicitly understood it and it ran throughout our whole thinking about the topic. Again, explaining stuff is hard though, I’m not claiming we conveyed everything well to everyone (clearly you haven’t either).
I want to note that I just reread Utility ≠ Reward and was pleasantly surprised by its treatment, as well as the hedges. I’m making an upwards update on these points having been understood by at least some thinkers, although I’ve also made a lot of downward updates for other reasons.
Thanks for this comment! I think it makes some sense (but would have been easier to read given meaningful variable names).
I either don’t understand the semantics of “=” here, or I disagree. Bob’s strategy doesn’t make sense because X and Z have type
behavior
, X1 and Z1 have typeutility function
, Y is some abstract reward function over some mathematical domain, Y1 is an empirical set of reinforcement events.It still seems to me like there is an error being made, such that Bob and Carol aren’t just trying to do different things or using different terminology, but that also Bob’s alignment strategy isn’t type-sensible or -coherent.
Reward functions often are structured as objectives, which is why we talk about them that way. In most situations, if you had access to e.g. AIXI, you could directly build a “reward maximizer.”
I agree that this is not always the case, though, as in the discussion here. That being said, I think it is often enough the case that it made sense to focus on that particular case in RFLO.
What does this mean? By “structured as objectives”, do you mean something like “people try to express what they want with a reward function, by conferring more reward to more desirable states”? (I’m going to assume so for the rest of the comment, LMK if this is wrong.)
I agree that other people (especially my past self) think about reward functions this way. I think they’re generally wrong to do so, and it’s misleading as to the real nature of the alignment problem.
I agree with that post, thanks for linking.
As far as I can tell, AIXI and other hardcoded planning agents are the known exceptions to the arguments in this post. We will not get AGI via these approaches. When else is it the case? I therefore still feel confused why you think it made sense.
While I definitely appreciate the work you all did with RFLO, the framing of reward as a “base objective” seems like a misstep that set discourse in a weird direction which I’m trying to push back on (from my POV!). I think that the “base objective” is better described as a “cognitive-update-generator.” (This is not me trying to educate you on this specific point, but rather argue that it really matters how we frame the problem in our day-to-day reasoning.)