When you say things like “Any reasoning derived from the reward-optimization premise is now suspect until otherwise supported”, this assumes that the people doing this reasoning were using the premise in the mistaken way
I have considered the hypothesis that most alignment researchers do understand this post already, while also somehow reliably emitting statements which, to me, indicate that they do not understand it. I deem this hypothesis unlikely. I have also considered that I may be misunderstanding them, and think in some small fraction of instances I might be.
I do in fact think that few people actually already deeply internalized the points I’m making in this post, even including a few people who say they have or that this post is obvious. Therefore, I concluded that lots of alignment thinking is suspect until re-analyzed.
I did preface “Here are some major updates which I made:”. The post is ambiguous on whether/why I believe others have been mistaken, though. I felt that if I just blurted out my true beliefs about how people had been reasoning incorrectly, people would get defensive. I did in fact consider combing through Ajeya’s post for disagreements, but I thought it’d be better to say “here’s a new frame” and less “here’s what I think you have been doing wrong.” So I just stated the important downstream implication: Be very, very careful in analyzing prior alignment thinking on RL+DL.
I now think that, even though there’s some sense in which in theory “building good cognition within the agent” is the only goal we care about, in practice this claim is somewhat misleading, because incrementally improving reward functions (including by doing things like making rewards depend on activations, or amplification in general) is a very good mechanism for moving agents towards the type of cognition we’d like them to do—and we have very few other mechanisms for doing so.
I have relatively little idea how to “improve” a reward function so that it improves the inner cognition chiseled into the policy, because I don’t know the mapping from outer reward schedules to inner cognition within the agent. Does an “amplified” reward signal produce better cognition in the inner agent? Possibly? Even if that were true, how would I know it?
I think it’s easy to say “and we have improved the reward function”, but this is true exactly to the extent to which the reward schedule actually produces more desirable cognition within the AI. Which comes back to my point: Build good cognition, and don’t lose track that that’s the ultimate goal. Find ways to better understand how reward schedules + data → inner values.
(I agree with your excerpt, but I suspect it makes the case too mildly to correct the enormous mistakes I perceive to be made by substantial amounts of alignment thinking.)
It seems to me that the basic conceptual point made in this post is entirely contained in our Risks from Learned Optimization paper. I might just be missing a point. You’ve certainly phrased things differently and made some specific points that we didn’t, but am I just misunderstanding something if I think the basic conceptual claims of this post (which seems to be presented as new) are implied by RFLO? If not, could you state briefly what is different?
(Note I am still surprised sometimes that people still think certain wireheading scenario’s make sense despite them having read RFLO, so it’s plausible to me that we really didn’t communicate everyrhing that’s in my head about this).
“Wireheading is improbable” is only half of the point of the essay.
The other main point is “reward functions are not the same type of object as utility functions.” I haven’t reread all of RFLO recently, but on a skim—RFLO consistently talks about reward functions as “objectives”:
The particular type of robustness problem that mesa-optimization falls into is the reward-result gap, the gap between the reward for which the system was trained (the base objective) and the reward that can be reconstructed from it using inverse reinforcement learning (the behavioral objective).
...
The assumption in that work is that a monotonic relationship between the learned reward and true reward indicates alignment, whereas deviations from that suggest misalignment. Building on this sort of research, better theoretical measures of alignment might someday allow us to speak concretely in terms of provable guarantees about the extent to which a mesa-optimizer is aligned with the base optimizer that created it.
Which is reasonable parlance, given that everyone else uses it, but I don’t find that terminology very useful for thinking about what kinds of inner cognition will be developed in the network. Reward functions + environmental data provides a series of cognitive-updates to the network, in the form of reinforcement schedules. The reward function is not necessarily an ‘objective’ at all.
(You might have privately known about this distinction. Fine by me! But I can’t back it out from a skim of RFLO, even already knowing the insight and looking for it.)
Maybe you have made a gestalt-switch I haven’t made yet, or maybe yours is a better way to communicate the same thing, but: the way I think of it is that the reward function is just a function from states to numbers, and the way the information contained in the reward function affects the model parameters is via reinforcement of pre-existing computations.
Is there a difference between saying:
A reward function is an objective function, but the only way that it affects behaviour is via reinforcement of pre-existing computations in the model, and it doesn’t actually encode in any way the “goal” of the model itself.
A reward function is not an objective function, and the only way that it affects behaviour is via reinforcement of pre-existing computations in the model, and it doesn’t actually encode in any way the “goal” of the model itself.
It seems to me that once you acknowledge the point about reinforcement, the additional statement that reward is not an objective doesn’t actually imply anything further about the mechanistic properties of deep reinforcement learners? It is just a way to put a high-level conceptual story on top of it, and in this sense it seems to me that this point is already known (and in particular, contained within RFLO), even though we talked of the base objective still as an “objective”.
However, it might be that while RFLO pointed out the same mechanistic understanding that you have in mind, but calling it an objective tends in practice to not fully communicate that mechanistic understanding.
Or it might be that I am really not yet understanding that there is an actual diferrence in mechanistic understanding, or that my intuitions are still being misled by the wrong high-level concept even if I have the lower-level mechanistic understanding right.
(On the other hand, one reason to still call it an objective is because we really can think of the selection process, i.e. evolution/the learning algorithm of an RL agent, as having an objective but making imperfect choices, or we can think of the training objective as encoding a task that humans have in mind).
in this sense it seems to me that this point is already known (and in particular, contained within RFLO), even though we talked of the base objective still as an “objective”.
Where did RFLO point it out? RFLO talks about a mesa objective being different from the “base objective” (even though reward is not a kind of objective). IIRC on my skim most of the arguments were non-mechanistic reasoning about what gets selected for. (Which isn’t a knockdown complaint, but those arguments are also not about the mechanism.) Also see my comment to Evan.
Like, from my POV, people are reliably reasoning about what RL “selects for” via “lots of optimization pressure” on “high reward by the formal metric”, but who’s reasoning about what kinds of antecedent computations get reinforced when credit assignment activates? Can you give me examples of anyone else spelling this out in a straightforward fashion?
calling it an objective tends in practice to not fully communicate that mechanistic understanding.
Yeah, I think it just doesn’t communicate the mechanistic understanding (not even imperfectly, in most cases, I imagine). From my current viewpoint, I just wouldn’t call reward an objective at all, except in the context of learned antecedent-computation-reinforcement terminal values. It’s like if I said “My cake is red” when the cake is blue, I guess? IMO it’s just not how to communicate the concept.
On the other hand, one reason to still call it an objective is because we really can think of the selection process, i.e. evolution/the learning algorithm of an RL agent, as having an objective but making imperfect choices, or we can think of the training objective as encoding a task that humans have in mind
“even though reward is not a kind of objective”, this is a terminological issue. In my view, calling a “antecedent-computation reinforcement criterion” an “objective” matches my definition of “objective”, and this is just a matter of terminology. The term “objective” is ill-defined enough that “even though reward is not a kind of objective” is a terminological claim about objective, not a claim about math/the world.
The idea that RL agents “reinforce antecedent computations” is completely core to our story of deception. You could not make sense of our argument for deception if you didn’t look at RL systems in this way. Viewing the base optimizer as “trying” to achieve an “objective” but “failing” because it is being “deceived” by the mesa optimizer is purely a metaphorical/terminological choice. It doesn’t negate the fact that we all understood that the base optimizer is just reinforcing “antecedent computations”. How else could you make sense of the story of deception, where an existing model, which represents the mesa optimizer, is being reinforced by the base optimizer because that existing model understands the base optimizer’s optimization process?
I am not claiming that the RFLO communicated this point well, just that it was understood and absolutely was core to the paper, and large parts of the paper wouldn’t even make sense if you didn’t have this insight. (Certainly the fact that we called it an objective doesn’t communicate the point, and it isn’t meant to).
I am not claiming that the RFLO communicated this point well, just that it was understood and absolutely was core to the paper, and large parts of the paper wouldn’t even make sense if you didn’t have this insight.
I think most ML practitioners do have implicit models of how reward chisels computation into agents, as seen with how they play around with e.g. reward shaping and such. It’s that I don’t perceive this knowledge to be engaged when some people reason about “optimization processes” and “selecting for high-reward models” on e.g. LW.
I just continue to think “I wouldn’t write RFLO the way it was written, if I had deeply and consciously internalized the lessons of OP”, but it’s possible this is a terminological/framing thing. Your comment does update me some, but I think I mostly retain my view here. I do totally buy that you all had good implicit models of the reward-chiseling point.
FWIW, I think a bunch of my historical frustration here has been an experience of:
Pointing out the “reward chisels computation” point
Having some people tell me it’s obvious, or already known, or that they already invented it
Seeing some of the same people continue making similar mistakes (according to me)
Not finding instances of other people making these points before OP
Continuing (AFAICT) to correct people on (what I claim to be) mistakes around reward and optimization targets, and (for a while) was ~the only one doing so.
If I found several comments explaining what is clearly the “reward chisels computation” point, where the comments were posted before this post, by people who weren’t me or downstream of my influence, I would update against my points being novel and towards my points using different terminology.
IIRC there’s one comment from Wei_Dai from a few years back in this vein, but IDK of others.
Person A has mental model X and tries to explain X with explanation Q
Person B doesn’t get model X from Q, thinks a bit, and then writes explanation P, reads P and thinks: P is how it should have been explained all along, and Q didn’t actually contain the insights, but P does.
Person C doesn’t get model X from P, thinks a bit, and then writes explanation R, reads R and thinks: …
It seems to me quite likely that you are person B, thinking they explained something because THEY think their explanation is very good and contains all the insights that the previous ones didn’t. Some of the evidence for this is in fact contained in your very comment:
“1. Pointing out the “reward chisels computation” point. 2. Having some people tell me it’s obvious, or already known, or that they already invented it. 3. Seeing some of the same people continue making similar mistakes (according to me)” So point 3 basically almost definitively proves that your mental model is not conveyed to those people in your post, does it not? I think a similar thing happened where that mental model was not conveyed to you from RFLO, even though we tried to convey it. (btw not saying the models that RFLO tried to explain are the same as this post, but the basic idea of this post definitely is a part of RFLO).
BTW, it could in fact be that person B’s explanation is clearer. (otoh, I think some things are less clear, e.g. you talk about “the” optimization target, which I would say is referring to that of the mesa-optimizer, without clearly assuming there is a mesa-optimizer. We stated the terms mesa- and base-optimizer to clearly make the distinction. There are a bunch of other things that I think are just imprecise, but let’s not get into it).
“Continuing (AFAICT) to correct people on (what I claim to be) mistakes around reward and optimization targets, and (for a while) was ~the only one doing so.”
I have been correcting people for a while on stuff like that (though not on LW, I’m not often on LW), such as that in the generic case we shouldn’t expect wireheading from RL agents unless the option of wireheading is in the training environment, for basically these reasons. I would also have expected people to just get this after reading RFLO, but many didn’t (others did), so your points 1/2/3 also apply to me.
“I do totally buy that you all had good implicit models of the reward-chiseling point”. I don’t think we just “implicitly” modeled it, we very explicitly understood it and it ran throughout our whole thinking about the topic. Again, explaining stuff is hard though, I’m not claiming we conveyed everything well to everyone (clearly you haven’t either).
I want to note that I just reread Utility ≠ Reward and was pleasantly surprised by its treatment, as well as the hedges. I’m making an upwards update on these points having been understood by at least some thinkers, although I’ve also made a lot of downward updates for other reasons.
Thanks for this comment! I think it makes some sense (but would have been easier to read given meaningful variable names).
Bob’s alignment strategy is that he wants X = X1 = Y = Y1 = Z = Z1. Also he wants the end result to be an agent whose good behaviours (Z) are in fact maximising a utility function at all (in this case, Z1).
I either don’t understand the semantics of “=” here, or I disagree. Bob’s strategy doesn’t make sense because X and Z have type behavior, X1 and Z1 have type utility function, Y is some abstract reward function over some mathematical domain, Y1 is an empirical set of reinforcement events.
It still seems to me like there is an error being made, such that Bob and Carol aren’t just trying to do different things or using different terminology, but that also Bob’s alignment strategy isn’t type-sensible or -coherent.
Reward functions often are structured as objectives, which is why we talk about them that way. In most situations, if you had access to e.g. AIXI, you could directly build a “reward maximizer.”
I agree that this is not always the case, though, as in the discussion here. That being said, I think it is often enough the case that it made sense to focus on that particular case in RFLO.
Reward functions often are structured as objectives
What does this mean? By “structured as objectives”, do you mean something like “people try to express what they want with a reward function, by conferring more reward to more desirable states”? (I’m going to assume so for the rest of the comment, LMK if this is wrong.)
I agree that other people (especially my past self) think about reward functions this way. I think they’re generally wrong to do so, and it’s misleading as to the real nature of the alignment problem.
I agree that this is not always the case, though, as in the discussion here.
I agree with that post, thanks for linking.
if you had access to e.g. AIXI, you could directly build a “reward maximizer.”
I agree that this is not always the case, though, as in the discussion here. That being said, I think it is often enough the case that it made sense to focus on that particular case in RFLO.
As far as I can tell, AIXI and other hardcoded planning agents are the known exceptions to the arguments in this post. We will not get AGI via these approaches. When else is it the case? I therefore still feel confused why you think it made sense.
While I definitely appreciate the work you all did with RFLO, the framing of reward as a “base objective” seems like a misstep that set discourse in a weird direction which I’m trying to push back on (from my POV!). I think that the “base objective” is better described as a “cognitive-update-generator.” (This is not me trying to educate you on this specific point, but rather argue that it really matters how we frame the problem in our day-to-day reasoning.)
I do in fact think that few people actually already deeply internalized the points I’m making in this post, even including a few people who say they have or that this post is obvious. Therefore, I concluded that lots of alignment thinking is suspect until re-analyzed.
“Risks from Learned Optimization in Advanced Machine Learning Systems,” which we published three years ago and started writing four years ago, is extremely explicit that we don’t know how to get an agent that is actually optimizing for a specified reward function. The alignment research community has been heavily engaging with this idea since then. Though I agree that many alignment researchers used to be making this mistake, I think it’s extremely clear that by this point most serious alignment researchers understand the distinction.
I have relatively little idea how to “improve” a reward function so that it improves the inner cognition chiseled into the policy, because I don’t know the mapping from outer reward schedules to inner cognition within the agent. Does an “amplified” reward signal produce better cognition in the inner agent? Possibly? Even if that were true, how would I know it?
which we published three years ago and started writing four years ago, is extremely explicit that we don’t know how to get an agent that is actually optimizing for a specified reward function.
That isn’t the main point I had in mind. See my comment to Chris here.
That isn’t the main point I had in mind. See my comment to Chris here.
Left a comment.
Yup, the training story regime sounds good by my lights. Am I intended to conclude something further from this remark of yours, though?
Nope, just wanted to draw your attention to another instance of alignment researchers already understanding this point.
Also, I want to be clear that I like this post a lot and I’m glad you wrote it—I think it’s good to explain this sort of thing more, especially in different ways that are likely to click for different people. I just think your specific claim that most alignment researchers don’t understand this already is false.
I just think your specific claim that most alignment researchers don’t understand this already is false.
I have privately corresponded with a senior researcher who, when asked what they thought would result from a specific training scenario, made an explicit (and acknowledged) mistake along the lines of this post. Another respected researcher seemingly slipped on the same point, some time after already discussing this post with them. I am still not sure whether I’m on the same page with Paul, as well (I have general trouble understanding what he believes, though). And Rohin also has this experience of explaining the points in OP on a regular basis. All this among many other private communication events I’ve experienced.
(Out of everyone I would expect to already have understood this post, I think you and Rohin would be at the top of the list.)
So basically, the above screens off “Who said what in past posts?”, because whoever said whatever, it’s still producing my weekly experiences of explaining the points in this post. I still haven’t seen the antecedent-computation-reinforcement (ACR) emphasis thoroughly explained elsewhere, although I agree that some important bits (like training stories) are not novel to this post. (The point isn’t so much “What do I get credit for?” as much as “I am concerned about this situation.”)
Here’s more speculation. I think alignment theorists mostly reason via selection-level arguments. While they might answer correctly on “Reward is? optimization target” when pressed, and implicitly use ACR to reason about what’s going on in their ML training runs, I’d guess that probably don’t engage in mechanistic ACR reasoning in their day-to-day theorizing. (Again, I can only speculate, because I am not a mind-reader, but I do still have beliefs on the matter.)
(Just wanted to echo that I agree with TurnTrout that I find myself explaining the point that reward may not be the optimization target a lot, and I think I disagree somewhat with Ajeya’s recent post for similar reasons. I don’t think that the people I’m explaining it to literally don’t understand the point at all; I think it mostly hasn’t propagated into some parts of their other reasoning about alignment. I’m less on board with the “it’s incorrect to call reward a base objective” point but I think it’s pretty plausible that once I actually understand what TurnTrout is saying there I’ll agree with it.)
I have relatively little idea how to “improve” a reward function so that it improves the inner cognition chiseled into the policy, because I don’t know the mapping from outer reward schedules to inner cognition within the agent.
You don’t need to know the full mapping in order to suspect that, when we reward agents for doing undesirable things, we tend to get more undesirable cognition. For example, if we reward agents for lying to us, then we’ll tend to get less honest agents. We can construct examples where this isn’t true but it seems like a pretty reasonable working hypothesis. It’s possible that discarding this working hypothesis will lead to better research but I don’t think your arguments manage to establish that, they only establish that we might in theory find ourselves in a situation where it’s reasonable to discard this working hypothesis.
This specific point is why I said “relatively” little idea, and not zero idea. You have defended the common-sense version of “improving” a reward function (which I agree with, don’t reward obvious bad things), but I perceive you to have originally claimed a much more aggressive and speculative claim, which is something like “‘amplified’ reward signals are improvements over non-‘amplified’ reward signals” (which might well be true, but how would we know?).
Amplification can just be used as a method for making more and better common-sense improvements, though. You could also do all sorts of other stuff with it, but standard examples (like “catch agents when they lie to us”) seem very much like common-sense improvements.
I have considered the hypothesis that most alignment researchers do understand this post already, while also somehow reliably emitting statements which, to me, indicate that they do not understand it. I deem this hypothesis unlikely. I have also considered that I may be misunderstanding them, and think in some small fraction of instances I might be.
I do in fact think that few people actually already deeply internalized the points I’m making in this post, even including a few people who say they have or that this post is obvious. Therefore, I concluded that lots of alignment thinking is suspect until re-analyzed.
I did preface “Here are some major updates which I made:”. The post is ambiguous on whether/why I believe others have been mistaken, though. I felt that if I just blurted out my true beliefs about how people had been reasoning incorrectly, people would get defensive. I did in fact consider combing through Ajeya’s post for disagreements, but I thought it’d be better to say “here’s a new frame” and less “here’s what I think you have been doing wrong.” So I just stated the important downstream implication: Be very, very careful in analyzing prior alignment thinking on RL+DL.
I have relatively little idea how to “improve” a reward function so that it improves the inner cognition chiseled into the policy, because I don’t know the mapping from outer reward schedules to inner cognition within the agent. Does an “amplified” reward signal produce better cognition in the inner agent? Possibly? Even if that were true, how would I know it?
I think it’s easy to say “and we have improved the reward function”, but this is true exactly to the extent to which the reward schedule actually produces more desirable cognition within the AI. Which comes back to my point: Build good cognition, and don’t lose track that that’s the ultimate goal. Find ways to better understand how reward schedules + data → inner values.
(I agree with your excerpt, but I suspect it makes the case too mildly to correct the enormous mistakes I perceive to be made by substantial amounts of alignment thinking.)
It seems to me that the basic conceptual point made in this post is entirely contained in our Risks from Learned Optimization paper. I might just be missing a point. You’ve certainly phrased things differently and made some specific points that we didn’t, but am I just misunderstanding something if I think the basic conceptual claims of this post (which seems to be presented as new) are implied by RFLO? If not, could you state briefly what is different?
(Note I am still surprised sometimes that people still think certain wireheading scenario’s make sense despite them having read RFLO, so it’s plausible to me that we really didn’t communicate everyrhing that’s in my head about this).
“Wireheading is improbable” is only half of the point of the essay.
The other main point is “reward functions are not the same type of object as utility functions.” I haven’t reread all of RFLO recently, but on a skim—RFLO consistently talks about reward functions as “objectives”:
Which is reasonable parlance, given that everyone else uses it, but I don’t find that terminology very useful for thinking about what kinds of inner cognition will be developed in the network. Reward functions + environmental data provides a series of cognitive-updates to the network, in the form of reinforcement schedules. The reward function is not necessarily an ‘objective’ at all.
(You might have privately known about this distinction. Fine by me! But I can’t back it out from a skim of RFLO, even already knowing the insight and looking for it.)
Maybe you have made a gestalt-switch I haven’t made yet, or maybe yours is a better way to communicate the same thing, but: the way I think of it is that the reward function is just a function from states to numbers, and the way the information contained in the reward function affects the model parameters is via reinforcement of pre-existing computations.
Is there a difference between saying:
A reward function is an objective function, but the only way that it affects behaviour is via reinforcement of pre-existing computations in the model, and it doesn’t actually encode in any way the “goal” of the model itself.
A reward function is not an objective function, and the only way that it affects behaviour is via reinforcement of pre-existing computations in the model, and it doesn’t actually encode in any way the “goal” of the model itself.
It seems to me that once you acknowledge the point about reinforcement, the additional statement that reward is not an objective doesn’t actually imply anything further about the mechanistic properties of deep reinforcement learners? It is just a way to put a high-level conceptual story on top of it, and in this sense it seems to me that this point is already known (and in particular, contained within RFLO), even though we talked of the base objective still as an “objective”.
However, it might be that while RFLO pointed out the same mechanistic understanding that you have in mind, but calling it an objective tends in practice to not fully communicate that mechanistic understanding.
Or it might be that I am really not yet understanding that there is an actual diferrence in mechanistic understanding, or that my intuitions are still being misled by the wrong high-level concept even if I have the lower-level mechanistic understanding right.
(On the other hand, one reason to still call it an objective is because we really can think of the selection process, i.e. evolution/the learning algorithm of an RL agent, as having an objective but making imperfect choices, or we can think of the training objective as encoding a task that humans have in mind).
Where did RFLO point it out? RFLO talks about a mesa objective being different from the “base objective” (even though reward is not a kind of objective). IIRC on my skim most of the arguments were non-mechanistic reasoning about what gets selected for. (Which isn’t a knockdown complaint, but those arguments are also not about the mechanism.) Also see my comment to Evan.
Like, from my POV, people are reliably reasoning about what RL “selects for” via “lots of optimization pressure” on “high reward by the formal metric”, but who’s reasoning about what kinds of antecedent computations get reinforced when credit assignment activates? Can you give me examples of anyone else spelling this out in a straightforward fashion?
Yeah, I think it just doesn’t communicate the mechanistic understanding (not even imperfectly, in most cases, I imagine). From my current viewpoint, I just wouldn’t call reward an objective at all, except in the context of learned antecedent-computation-reinforcement terminal values. It’s like if I said “My cake is red” when the cake is blue, I guess? IMO it’s just not how to communicate the concept.
Why is this reasonable?
Very late reply, sorry.
“even though reward is not a kind of objective”, this is a terminological issue. In my view, calling a “antecedent-computation reinforcement criterion” an “objective” matches my definition of “objective”, and this is just a matter of terminology. The term “objective” is ill-defined enough that “even though reward is not a kind of objective” is a terminological claim about objective, not a claim about math/the world.
The idea that RL agents “reinforce antecedent computations” is completely core to our story of deception. You could not make sense of our argument for deception if you didn’t look at RL systems in this way. Viewing the base optimizer as “trying” to achieve an “objective” but “failing” because it is being “deceived” by the mesa optimizer is purely a metaphorical/terminological choice. It doesn’t negate the fact that we all understood that the base optimizer is just reinforcing “antecedent computations”. How else could you make sense of the story of deception, where an existing model, which represents the mesa optimizer, is being reinforced by the base optimizer because that existing model understands the base optimizer’s optimization process?
I am not claiming that the RFLO communicated this point well, just that it was understood and absolutely was core to the paper, and large parts of the paper wouldn’t even make sense if you didn’t have this insight. (Certainly the fact that we called it an objective doesn’t communicate the point, and it isn’t meant to).
I think most ML practitioners do have implicit models of how reward chisels computation into agents, as seen with how they play around with e.g. reward shaping and such. It’s that I don’t perceive this knowledge to be engaged when some people reason about “optimization processes” and “selecting for high-reward models” on e.g. LW.
I just continue to think “I wouldn’t write RFLO the way it was written, if I had deeply and consciously internalized the lessons of OP”, but it’s possible this is a terminological/framing thing. Your comment does update me some, but I think I mostly retain my view here. I do totally buy that you all had good implicit models of the reward-chiseling point.
FWIW, I think a bunch of my historical frustration here has been an experience of:
Pointing out the “reward chisels computation” point
Having some people tell me it’s obvious, or already known, or that they already invented it
Seeing some of the same people continue making similar mistakes (according to me)
Not finding instances of other people making these points before OP
Continuing (AFAICT) to correct people on (what I claim to be) mistakes around reward and optimization targets, and (for a while) was ~the only one doing so.
If I found several comments explaining what is clearly the “reward chisels computation” point, where the comments were posted before this post, by people who weren’t me or downstream of my influence, I would update against my points being novel and towards my points using different terminology.
IIRC there’s one comment from Wei_Dai from a few years back in this vein, but IDK of others.
There is a general phenomenon where:
Person A has mental model X and tries to explain X with explanation Q
Person B doesn’t get model X from Q, thinks a bit, and then writes explanation P, reads P and thinks: P is how it should have been explained all along, and Q didn’t actually contain the insights, but P does.
Person C doesn’t get model X from P, thinks a bit, and then writes explanation R, reads R and thinks: …
It seems to me quite likely that you are person B, thinking they explained something because THEY think their explanation is very good and contains all the insights that the previous ones didn’t. Some of the evidence for this is in fact contained in your very comment:
“1. Pointing out the “reward chisels computation” point. 2. Having some people tell me it’s obvious, or already known, or that they already invented it. 3. Seeing some of the same people continue making similar mistakes (according to me)”
So point 3 basically almost definitively proves that your mental model is not conveyed to those people in your post, does it not? I think a similar thing happened where that mental model was not conveyed to you from RFLO, even though we tried to convey it. (btw not saying the models that RFLO tried to explain are the same as this post, but the basic idea of this post definitely is a part of RFLO).
BTW, it could in fact be that person B’s explanation is clearer. (otoh, I think some things are less clear, e.g. you talk about “the” optimization target, which I would say is referring to that of the mesa-optimizer, without clearly assuming there is a mesa-optimizer. We stated the terms mesa- and base-optimizer to clearly make the distinction. There are a bunch of other things that I think are just imprecise, but let’s not get into it).
“Continuing (AFAICT) to correct people on (what I claim to be) mistakes around reward and optimization targets, and (for a while) was ~the only one doing so.”
I have been correcting people for a while on stuff like that (though not on LW, I’m not often on LW), such as that in the generic case we shouldn’t expect wireheading from RL agents unless the option of wireheading is in the training environment, for basically these reasons. I would also have expected people to just get this after reading RFLO, but many didn’t (others did), so your points 1/2/3 also apply to me.
“I do totally buy that you all had good implicit models of the reward-chiseling point”. I don’t think we just “implicitly” modeled it, we very explicitly understood it and it ran throughout our whole thinking about the topic. Again, explaining stuff is hard though, I’m not claiming we conveyed everything well to everyone (clearly you haven’t either).
I want to note that I just reread Utility ≠ Reward and was pleasantly surprised by its treatment, as well as the hedges. I’m making an upwards update on these points having been understood by at least some thinkers, although I’ve also made a lot of downward updates for other reasons.
Thanks for this comment! I think it makes some sense (but would have been easier to read given meaningful variable names).
I either don’t understand the semantics of “=” here, or I disagree. Bob’s strategy doesn’t make sense because X and Z have type
behavior
, X1 and Z1 have typeutility function
, Y is some abstract reward function over some mathematical domain, Y1 is an empirical set of reinforcement events.It still seems to me like there is an error being made, such that Bob and Carol aren’t just trying to do different things or using different terminology, but that also Bob’s alignment strategy isn’t type-sensible or -coherent.
Reward functions often are structured as objectives, which is why we talk about them that way. In most situations, if you had access to e.g. AIXI, you could directly build a “reward maximizer.”
I agree that this is not always the case, though, as in the discussion here. That being said, I think it is often enough the case that it made sense to focus on that particular case in RFLO.
What does this mean? By “structured as objectives”, do you mean something like “people try to express what they want with a reward function, by conferring more reward to more desirable states”? (I’m going to assume so for the rest of the comment, LMK if this is wrong.)
I agree that other people (especially my past self) think about reward functions this way. I think they’re generally wrong to do so, and it’s misleading as to the real nature of the alignment problem.
I agree with that post, thanks for linking.
As far as I can tell, AIXI and other hardcoded planning agents are the known exceptions to the arguments in this post. We will not get AGI via these approaches. When else is it the case? I therefore still feel confused why you think it made sense.
While I definitely appreciate the work you all did with RFLO, the framing of reward as a “base objective” seems like a misstep that set discourse in a weird direction which I’m trying to push back on (from my POV!). I think that the “base objective” is better described as a “cognitive-update-generator.” (This is not me trying to educate you on this specific point, but rather argue that it really matters how we frame the problem in our day-to-day reasoning.)
“Risks from Learned Optimization in Advanced Machine Learning Systems,” which we published three years ago and started writing four years ago, is extremely explicit that we don’t know how to get an agent that is actually optimizing for a specified reward function. The alignment research community has been heavily engaging with this idea since then. Though I agree that many alignment researchers used to be making this mistake, I think it’s extremely clear that by this point most serious alignment researchers understand the distinction.
This is precisely the point I make in “How do we become confident in the safety of a machine learning system is making,” btw.
That isn’t the main point I had in mind. See my comment to Chris here.
EDIT:
Yup, the training story regime sounds good by my lights. Am I intended to conclude something further from this remark of yours, though?
Left a comment.
Nope, just wanted to draw your attention to another instance of alignment researchers already understanding this point.
Also, I want to be clear that I like this post a lot and I’m glad you wrote it—I think it’s good to explain this sort of thing more, especially in different ways that are likely to click for different people. I just think your specific claim that most alignment researchers don’t understand this already is false.
I have privately corresponded with a senior researcher who, when asked what they thought would result from a specific training scenario, made an explicit (and acknowledged) mistake along the lines of this post. Another respected researcher seemingly slipped on the same point, some time after already discussing this post with them. I am still not sure whether I’m on the same page with Paul, as well (I have general trouble understanding what he believes, though). And Rohin also has this experience of explaining the points in OP on a regular basis. All this among many other private communication events I’ve experienced.
(Out of everyone I would expect to already have understood this post, I think you and Rohin would be at the top of the list.)
So basically, the above screens off “Who said what in past posts?”, because whoever said whatever, it’s still producing my weekly experiences of explaining the points in this post. I still haven’t seen the antecedent-computation-reinforcement (ACR) emphasis thoroughly explained elsewhere, although I agree that some important bits (like training stories) are not novel to this post. (The point isn’t so much “What do I get credit for?” as much as “I am concerned about this situation.”)
Here’s more speculation. I think alignment theorists mostly reason via selection-level arguments. While they might answer correctly on “Reward is? optimization target” when pressed, and implicitly use ACR to reason about what’s going on in their ML training runs, I’d guess that probably don’t engage in mechanistic ACR reasoning in their day-to-day theorizing. (Again, I can only speculate, because I am not a mind-reader, but I do still have beliefs on the matter.)
(Just wanted to echo that I agree with TurnTrout that I find myself explaining the point that reward may not be the optimization target a lot, and I think I disagree somewhat with Ajeya’s recent post for similar reasons. I don’t think that the people I’m explaining it to literally don’t understand the point at all; I think it mostly hasn’t propagated into some parts of their other reasoning about alignment. I’m less on board with the “it’s incorrect to call reward a base objective” point but I think it’s pretty plausible that once I actually understand what TurnTrout is saying there I’ll agree with it.)
You don’t need to know the full mapping in order to suspect that, when we reward agents for doing undesirable things, we tend to get more undesirable cognition. For example, if we reward agents for lying to us, then we’ll tend to get less honest agents. We can construct examples where this isn’t true but it seems like a pretty reasonable working hypothesis. It’s possible that discarding this working hypothesis will lead to better research but I don’t think your arguments manage to establish that, they only establish that we might in theory find ourselves in a situation where it’s reasonable to discard this working hypothesis.
This specific point is why I said “relatively” little idea, and not zero idea. You have defended the common-sense version of “improving” a reward function (which I agree with, don’t reward obvious bad things), but I perceive you to have originally claimed a much more aggressive and speculative claim, which is something like “‘amplified’ reward signals are improvements over non-‘amplified’ reward signals” (which might well be true, but how would we know?).
Amplification can just be used as a method for making more and better common-sense improvements, though. You could also do all sorts of other stuff with it, but standard examples (like “catch agents when they lie to us”) seem very much like common-sense improvements.