I wasn’t around in the community in 2010-2015, so I don’t know what the state of RL knowledge was at that time. However, I dispute the claim that rationalists “completely miss[ed] this [..] interpretation”:
To be honest, it was a major blackpill for me to see the rationalist community, whose whole whole founding premise was that they were supposed to be good at making efficient use of the available evidence, so completely missing this very straightforward interpretation of RL [..] the mechanistic function of per-trajectory rewards in a given batched update was to provide the weights of a linear combination of the trajectory gradients.
Ever since I entered the community, I’ve definitely heard of people talking about policy gradient as “upweighting trajectories with positive reward/downweighting trajectories with negative reward” since 2016, albeit in person. I remember being shown a picture sometime in 2016⁄17 that looks something like this when someone (maybe Paul?) was explaining REINFORCE to me: (I couldn’t find it, so reconstructing it from memory)
In addition, I would be surprised if any of the CHAI PhD students when I was at CHAI from 2017->2021, many of whom have taken deep RL classes at Berkeley, missed this “upweight trajectories in proportion to their reward” intepretation? Most of us at the time have also implemented various RL algorithms from scratch, and there the “weighting trajectory gradients” perspective pops out immediately.
As another data point, when I taught MLAB/WMLB in 2022⁄3, my slides also contained this interpretation of REINFORCE (after deriving it) in so many words:
Insofar as people are making mistakes about reward and RL, it’s not due to having never been exposed to this perspective.
That being said, I do agree that there’s been substantial confusion in this community, mainly of two kinds:
Confusing the objective function being optimized to train a policy with how the policy is mechanistically implemented: Just because the outer loop is modifying/selecting for a policy to score highly on some objective function, does not necessarily mean that the resulting policy will end up selecting actions based on said objective.
Confusing “this policy is optimized for X” with “this policy is optimal for X”: this is the actual mistake I think Bostom is making in Alex’s example—it’s true that an agent that wireheads achieves higher reward than on the training distribution (and the optimal agent for the reward achieves reward at least as good as wireheading). And I think that Alex and you would also agree with me that it’s sometimes valuable to reason about the global optima in policy space. But it’s a mistake to identify the outputs of optimization with the optimal solution to an optimization problem, and many people were making this jump without noticing it.
Again, I contend these confusions were not due to a lack of exposure to the “rewards as weighting trajectories” perspective. Instead, the reasons I remember hearing back in 2017-2018 for why we should jump from “RL is optimizing agents for X” to “RL outputs agents that both optimize X and are optimal for X”:
We’d be really confused if we couldn’t reason about “optimal” agents, so we should solve that first. This is the main justification I heard from the MIRI people about why they studied idealized agents. Oftentimes globally optimal solutions are easier to reason about than local optima or saddle points, or are useful for operationalizing concepts. Because a lot of the community was focused on philosophical deconfusion (often w/ minimal knowledge of ML or RL), many people naturally came to jump the gap between “the thing we’re studying” and “the thing we care about”.
Reasoning about optima gives a better picture of powerful, future AGIs. Insofar as we’re far from transformative AI, you might expect that current AIs are a poor model for how transformative AI will look. In particular, you might expect that modeling transformative AI as optimal leads to clearer reasoning than analogizing them to current systems. This point has become increasingly tenuous since GPT-2, but
Some off-policy RL algorithms are well described as having a “reward” maximizing component: And, these were the approaches that people were using and thinking about at the time. For example, the most hyped results in deep learning in the mid 2010s were probably DQN and AlphaGo/GoZero/Zero. And many people believed that future AIs would be implemented via model-based RL. All of these approaches result in policies that contain an internal component which is searching for actions that maximize some learned objective. Given that ~everyone uses policy gradient variants for RL on SOTA LLMs, this does turn out to be incorrect ex post. But if the most impressive AIs seem to be implemented in ways that correspond to internal reward maximization, it does seem very understandable to think about AGIs as explicit reward optimizers.
This is how many RL pioneers reasoned about their algorithms. I agree with Alex that this is probably from the control theory routes, where a PID controller is well modeled as picking trajectories that minimize cost, in a way that early simple RL policies are not well modeled as internally picking trajectories that maximize reward.
Also, sometimes it is just the words being similar; it can be hard to keep track of the differences between “optimizing for”, “optimized for”, and “optimal for” in normal conversation.
I think if you want to prevent the community from repeating these confusions, this looks less like “here’s an alternative perspective through which you can view policy gradient” and more “here’s why reasoning about AGI as ‘optimal’ agents is misleading” and “here’s why reasoning about your 1 hidden layer neural network policy as if it were optimizing the reward is bad”.
An aside:
In general, I think that many ML-knowledgeable people (arguably myself included) correctly notice that the community is making many mistakes in reasoning, that they resolve internally using ML terminology or frames from the ML literature. But without reasoning carefully about the problem, the terminology or frames themselves are insufficient to resolve the confusion. (Notice how many Deep RL people make the same mistake!) And, as Alex and you have argued before, the standard ML frames and terminology introduce their own confusions (e.g. ‘attention’).
A shallow understanding of “policy gradient is just upweighting trajectories” may in fact lead to making the opposite mistake: assuming that it can never lead to intelligent, optimizer-y behavior. (Again, notice how many ML academics made exactly this mistake) Or, more broadly, thinking about ML algorithms purely from the low-level, mechanistic frame can lead to confusions along the lines of “next token prediction can only lead to statistical parrots without true intelligence”. Doubly so if you’ve only worked with policy gradient or language modeling with tiny models.
Ever since I entered the community, I’ve definitely heard of people talking about policy gradient as “upweighting trajectories with positive reward/downweighting trajectories with negative reward” since 2016, albeit in person. I remember being shown a picture sometime in 2016⁄17 that looks something like this when someone (maybe Paul?) was explaining REINFORCE to me: (I couldn’t find it, so reconstructing it from memory)
Knowing how to reason about “upweighting trajectories” when explicitly prompted or in narrow contexts of algorithmic implementation is not sufficient to conclude “people basically knew this perspective” (but it’s certainly evidence). See Outside the Laboratory:
Now suppose we discover that a Ph.D. economist buys a lottery ticket every week. We have to ask ourselves: Does this person really understand expected utility, on a gut level? Or have they just been trained to perform certain algebra tricks?
Knowing “vanilla PG upweights trajectories”, and being able to explain the math—this is not enough to save someone from the rampant reward confusions. Certainly Yoshua Bengio could explain vanilla PG, and yet he goes on about how RL (almost certainly, IIRC) trains reward maximizers.
I contend these confusions were not due to a lack of exposure to the “rewards as weighting trajectories” perspective.
I personally disagree—although I think your list of alternative explanations is reasonable. If alignment theorists had been using this (simple and obvious-in-retrospect) “reward chisels circuits into the network” perspective, if they had really been using it and felt it deep within their bones, I think they would not have been particularly tempted by this family of mistakes.
What’s the difference between “Alice is falling victim to confusions/reasoning mistakes about X” and “Alice disagrees with me about X”?
I feel like using the former puts undue social pressure on observers to conclude that you’re right, and makes it less likely they correctly adjudicate between the perspectives.
(Perhaps you can empathise with me here, since arguably certain people taking this sort of tone is one of the reasons AI x-risk arguments have not always been vetted as carefully as they should!)
What’s the difference between “Alice is falling victim to confusions/reasoning mistakes about X” and “Alice disagrees with me about X”?
I suspect that, for Alex Turner, writing the former instead of the latter is a signal that he thinks he has identified the specific confusion/reasoning mistake his interlocutor is engaged in, likely as a result of having seen closely analogous arguments in the past from other people who turned out (or even admitted) to be confused about these matters after conversations with him.
I wasn’t around in the community in 2010-2015, so I don’t know what the state of RL knowledge was at that time. However, I dispute the claim that rationalists “completely miss[ed] this [..] interpretation”:
Ever since I entered the community, I’ve definitely heard of people talking about policy gradient as “upweighting trajectories with positive reward/downweighting trajectories with negative reward” since 2016, albeit in person. I remember being shown a picture sometime in 2016⁄17 that looks something like this when someone (maybe Paul?) was explaining REINFORCE to me: (I couldn’t find it, so reconstructing it from memory)
In addition, I would be surprised if any of the CHAI PhD students when I was at CHAI from 2017->2021, many of whom have taken deep RL classes at Berkeley, missed this “upweight trajectories in proportion to their reward” intepretation? Most of us at the time have also implemented various RL algorithms from scratch, and there the “weighting trajectory gradients” perspective pops out immediately.
As another data point, when I taught MLAB/WMLB in 2022⁄3, my slides also contained this interpretation of REINFORCE (after deriving it) in so many words:
Insofar as people are making mistakes about reward and RL, it’s not due to having never been exposed to this perspective.
That being said, I do agree that there’s been substantial confusion in this community, mainly of two kinds:
Confusing the objective function being optimized to train a policy with how the policy is mechanistically implemented: Just because the outer loop is modifying/selecting for a policy to score highly on some objective function, does not necessarily mean that the resulting policy will end up selecting actions based on said objective.
Confusing “this policy is optimized for X” with “this policy is optimal for X”: this is the actual mistake I think Bostom is making in Alex’s example—it’s true that an agent that wireheads achieves higher reward than on the training distribution (and the optimal agent for the reward achieves reward at least as good as wireheading). And I think that Alex and you would also agree with me that it’s sometimes valuable to reason about the global optima in policy space. But it’s a mistake to identify the outputs of optimization with the optimal solution to an optimization problem, and many people were making this jump without noticing it.
Again, I contend these confusions were not due to a lack of exposure to the “rewards as weighting trajectories” perspective. Instead, the reasons I remember hearing back in 2017-2018 for why we should jump from “RL is optimizing agents for X” to “RL outputs agents that both optimize X and are optimal for X”:
We’d be really confused if we couldn’t reason about “optimal” agents, so we should solve that first. This is the main justification I heard from the MIRI people about why they studied idealized agents. Oftentimes globally optimal solutions are easier to reason about than local optima or saddle points, or are useful for operationalizing concepts. Because a lot of the community was focused on philosophical deconfusion (often w/ minimal knowledge of ML or RL), many people naturally came to jump the gap between “the thing we’re studying” and “the thing we care about”.
Reasoning about optima gives a better picture of powerful, future AGIs. Insofar as we’re far from transformative AI, you might expect that current AIs are a poor model for how transformative AI will look. In particular, you might expect that modeling transformative AI as optimal leads to clearer reasoning than analogizing them to current systems. This point has become increasingly tenuous since GPT-2, but
Some off-policy RL algorithms are well described as having a “reward” maximizing component: And, these were the approaches that people were using and thinking about at the time. For example, the most hyped results in deep learning in the mid 2010s were probably DQN and AlphaGo/GoZero/Zero. And many people believed that future AIs would be implemented via model-based RL. All of these approaches result in policies that contain an internal component which is searching for actions that maximize some learned objective. Given that ~everyone uses policy gradient variants for RL on SOTA LLMs, this does turn out to be incorrect ex post. But if the most impressive AIs seem to be implemented in ways that correspond to internal reward maximization, it does seem very understandable to think about AGIs as explicit reward optimizers.
This is how many RL pioneers reasoned about their algorithms. I agree with Alex that this is probably from the control theory routes, where a PID controller is well modeled as picking trajectories that minimize cost, in a way that early simple RL policies are not well modeled as internally picking trajectories that maximize reward.
Also, sometimes it is just the words being similar; it can be hard to keep track of the differences between “optimizing for”, “optimized for”, and “optimal for” in normal conversation.
I think if you want to prevent the community from repeating these confusions, this looks less like “here’s an alternative perspective through which you can view policy gradient” and more “here’s why reasoning about AGI as ‘optimal’ agents is misleading” and “here’s why reasoning about your 1 hidden layer neural network policy as if it were optimizing the reward is bad”.
An aside:
In general, I think that many ML-knowledgeable people (arguably myself included) correctly notice that the community is making many mistakes in reasoning, that they resolve internally using ML terminology or frames from the ML literature. But without reasoning carefully about the problem, the terminology or frames themselves are insufficient to resolve the confusion. (Notice how many Deep RL people make the same mistake!) And, as Alex and you have argued before, the standard ML frames and terminology introduce their own confusions (e.g. ‘attention’).
A shallow understanding of “policy gradient is just upweighting trajectories” may in fact lead to making the opposite mistake: assuming that it can never lead to intelligent, optimizer-y behavior. (Again, notice how many ML academics made exactly this mistake) Or, more broadly, thinking about ML algorithms purely from the low-level, mechanistic frame can lead to confusions along the lines of “next token prediction can only lead to statistical parrots without true intelligence”. Doubly so if you’ve only worked with policy gradient or language modeling with tiny models.
Knowing how to reason about “upweighting trajectories” when explicitly prompted or in narrow contexts of algorithmic implementation is not sufficient to conclude “people basically knew this perspective” (but it’s certainly evidence). See Outside the Laboratory:
Knowing “vanilla PG upweights trajectories”, and being able to explain the math—this is not enough to save someone from the rampant reward confusions. Certainly Yoshua Bengio could explain vanilla PG, and yet he goes on about how RL (almost certainly, IIRC) trains reward maximizers.
I personally disagree—although I think your list of alternative explanations is reasonable. If alignment theorists had been using this (simple and obvious-in-retrospect) “reward chisels circuits into the network” perspective, if they had really been using it and felt it deep within their bones, I think they would not have been particularly tempted by this family of mistakes.
What’s the difference between “Alice is falling victim to confusions/reasoning mistakes about X” and “Alice disagrees with me about X”?
I feel like using the former puts undue social pressure on observers to conclude that you’re right, and makes it less likely they correctly adjudicate between the perspectives.
(Perhaps you can empathise with me here, since arguably certain people taking this sort of tone is one of the reasons AI x-risk arguments have not always been vetted as carefully as they should!)
I suspect that, for Alex Turner, writing the former instead of the latter is a signal that he thinks he has identified the specific confusion/reasoning mistake his interlocutor is engaged in, likely as a result of having seen closely analogous arguments in the past from other people who turned out (or even admitted) to be confused about these matters after conversations with him.
Do you have a reference to the problematic argument that Yoshua Bengio makes?