I think fewer other people were making this mistake than you expect (including people in the standard field of RL)
I think that few people understand these points already. If RL professionals did understand this point, there would be pushback on Reward is Enough from RL professionals pointing out that reward is not the optimization target. After 15 minutes of searching, I found noonemakingthecounterpoint. I mean, that thesis is just so wrong, and it’s by famous researchers, and no one points out the obvious error.
RL researchers don’t get it.[1] It’s not complicated to me.
(Do you know of any instance at allof someone else (outside of alignment) making the points in this post?)
for reasons that Paul laid out above.
Currently not convinced by / properly understanding Paul’s counterpoints.
Although I flag that we might be considering different kinds of “getting it”, where by my lights, “getting it” means “not consistently emitting statements which contravene the points of this post”, while you might consider “if pressed on the issue, will admit reward is not the optimization target” to be “getting it.”
The way I attempt to avoid confusion is to distinguish between the RL algorithm’s optimization target and the RL policy’s optimization target, and then avoid talking about the “RL agent’s” optimization target, since that’s ambiguous between the two meanings. I dislike the title of this post because it implies that there’s only one optimization target, which exacerbates this ambiguity. I predict that if you switch to using this terminology, and then start asking a bunch of RL researchers questions, they’ll tend to give broadly sensible answers (conditional on taking on the idea of “RL policy’s optimization target” as a reasonable concept).
Authors’ summary of the “reward is enough” paper:
In this paper we hypothesise that the objective of maximising reward is enough to drive behaviour that exhibits most if not all attributes of intelligence that are studied in natural and artificial intelligence, including knowledge, learning, perception, social intelligence, language and generalisation. This is in contrast to the view that specialised problem formulations are needed for each attribute of intelligence, based on other signals or objectives. The reward-is-enough hypothesis suggests that agents with powerful reinforcement learning algorithms when placed in rich environments with simple rewards could develop the kind of broad, multi-attribute intelligence that constitutes an artificial general intelligence.
I think this is consistent with your claims, because reward can be enough to drive intelligent-seeming behavior whether or not it is the target of learned optimization. Can you point to the specific claim in this summary that you disagree with? (or a part of the paper, if your disagreement isn’t captured in this summary).
More generally, consider the analogy to evolution. I view your position as analogous to saying: “hey, genetic fitness is not the optimization target of humans, therefore genetic fitness is not the optimization target of evolution”. The idea that genetic fitness is not the optimization target of humans is an important insight, but it’s clearly unhelpful to jump to “and therefore evolutionary biologists who talk about evolution optimizing for genetic fitness just don’t get it”, which seems analogous to what you’re doing in this post.
Importantly, reward does not magically spawn thoughts about reward, and reinforce those reward-focused thoughts! Just because common English endows “reward” with suggestive pleasurable connotations, that does not mean that an RL agent will terminally value reward!
Sufficiently intelligent RL policies will have the concept of reward because they understand many facts about machine learning and their own situation, and (if deceptively aligned) will think about reward a bunch. There may be some other argument for why this concept won’t get embedded as a terminal goal, but the idea that it needs to be “magically spawned” is very strawmanny.
Actually, while I did recheck the Reward is Enough paper, I think I did misunderstand part of it in a way which wasn’t obvious to me while I reread, which makes the paper much less egregious. I am updating that you are correct and I am not spending enough effort on favorably interpreting existing discourse.
I still disagree with parts of that essay and still think Sutton & co don’t understand the key points. I still think you underestimate how much people don’t get these points. I am provisionally retracting the comment you replied to while I compose a more thorough response (may be a little while).
Sufficiently intelligent RL policies will have the concept of reward because they understand many facts about machine learning and their own situation, and (if deceptively aligned) will think about reward a bunch. There may be some other argument for why this concept won’t get embedded as a terminal goal, but the idea that it needs to be “magically spawned” is very strawmanny.
Agreed on both counts for your first sentence.
The “and” in “reward does not magically spawn thoughts about reward, and reinforce those reward-focused thoughts” is doing important work; “magically” is meant to apply to the conjunction of the clauses. I added the second clause in order to pre-empt this objection. Maybe I should have added “reinforce those reward-focused thoughts into terminal values.”Would that have been clearer? (I also have gone ahead and replaced “magically” with “automatically.”)
Hmm, perhaps clearer to say “reward does not automatically reinforce reward-focused thoughts into terminal values”, given that we both agree that agents will have thoughts about reward either way.
But if you agree that reward gets reinforced as an instrumental value, then I think your claims here probably need to actually describe the distinction between terminal and instrumental values. And this feels pretty fuzzy—e.g. in humans, I think the distinction is actually not that clear-cut.
In other words, if everyone agrees that reward likely becomes a strong instrumental value, then this seems like a prima facie reason to think that it’s also plausible as a terminal value, unless you think the processes which give rise to terminal values are very different from the processes which give rise to instrumental values.
I think that few people understand these points already.
If RL professionals did understand this point, there would be pushback onReward is Enoughfrom RL professionals pointing out that reward is not the optimization target. After 15 minutes of searching, I foundnoonemakingthecounterpoint. I mean, that thesis is just so wrong, and it’s by famous researchers, and no one points out the obvious error.RL researchers don’t get it.[1] It’s not complicated to me.
(Do you know of any instance at all of someone else (outside of alignment) making the points in this post?)
Currently not convinced by / properly understanding Paul’s counterpoints.
Although I flag that we might be considering different kinds of “getting it”, where by my lights, “getting it” means “not consistently emitting statements which contravene the points of this post”, while you might consider “if pressed on the issue, will admit reward is not the optimization target” to be “getting it.”
The way I attempt to avoid confusion is to distinguish between the RL algorithm’s optimization target and the RL policy’s optimization target, and then avoid talking about the “RL agent’s” optimization target, since that’s ambiguous between the two meanings. I dislike the title of this post because it implies that there’s only one optimization target, which exacerbates this ambiguity. I predict that if you switch to using this terminology, and then start asking a bunch of RL researchers questions, they’ll tend to give broadly sensible answers (conditional on taking on the idea of “RL policy’s optimization target” as a reasonable concept).
Authors’ summary of the “reward is enough” paper:
I think this is consistent with your claims, because reward can be enough to drive intelligent-seeming behavior whether or not it is the target of learned optimization. Can you point to the specific claim in this summary that you disagree with? (or a part of the paper, if your disagreement isn’t captured in this summary).
More generally, consider the analogy to evolution. I view your position as analogous to saying: “hey, genetic fitness is not the optimization target of humans, therefore genetic fitness is not the optimization target of evolution”. The idea that genetic fitness is not the optimization target of humans is an important insight, but it’s clearly unhelpful to jump to “and therefore evolutionary biologists who talk about evolution optimizing for genetic fitness just don’t get it”, which seems analogous to what you’re doing in this post.
Sufficiently intelligent RL policies will have the concept of reward because they understand many facts about machine learning and their own situation, and (if deceptively aligned) will think about reward a bunch. There may be some other argument for why this concept won’t get embedded as a terminal goal, but the idea that it needs to be “magically spawned” is very strawmanny.
Actually, while I did recheck the Reward is Enough paper, I think I did misunderstand part of it in a way which wasn’t obvious to me while I reread, which makes the paper much less egregious. I am updating that you are correct and I am not spending enough effort on favorably interpreting existing discourse.
I still disagree with parts of that essay and still think Sutton & co don’t understand the key points. I still think you underestimate how much people don’t get these points. I am provisionally retracting the comment you replied to while I compose a more thorough response (may be a little while).
Agreed on both counts for your first sentence.
The “and” in “reward does not magically spawn thoughts about reward, and reinforce those reward-focused thoughts” is doing important work; “magically” is meant to apply to the conjunction of the clauses. I added the second clause in order to pre-empt this objection. Maybe I should have added “reinforce those reward-focused thoughts into terminal values.” Would that have been clearer? (I also have gone ahead and replaced “magically” with “automatically.”)
Hmm, perhaps clearer to say “reward does not automatically reinforce reward-focused thoughts into terminal values”, given that we both agree that agents will have thoughts about reward either way.
But if you agree that reward gets reinforced as an instrumental value, then I think your claims here probably need to actually describe the distinction between terminal and instrumental values. And this feels pretty fuzzy—e.g. in humans, I think the distinction is actually not that clear-cut.
In other words, if everyone agrees that reward likely becomes a strong instrumental value, then this seems like a prima facie reason to think that it’s also plausible as a terminal value, unless you think the processes which give rise to terminal values are very different from the processes which give rise to instrumental values.