In fact, PPO is essentially a tweaked version of REINFORCE,
Valid point.
Beyond PPO and REINFORCE, this “x as learning rate multiplier” pattern is actually extremely common in different RL formulations. From lecture 7 of David Silver’s RL course:
Critically though, neither Q, A or delta denote reward. Rather they are quantities which are supposed to estimate the effect of an action on the sum of future rewards; hence while pure REINFORCE doesn’t really maximize the sum of rewards, these other algorithms are attempts to more consistently do so, and the existence of such attempts shows that it’s likely we will see more better attempts in the future.
It was published in 1992, a full 22 years before Bostrom’s book.
Bostrom’s book explicitly states what kinds of reinforcement learning algorithms he had in mind, and they are not REINFORCE:
Often, the learning algorithm involves the
gradual construction of some kind of evaluation function, which assigns values
to states, state–action pairs, or policies. (For instance, a program can learn to
play backgammon by using reinforcement learning to incrementally improve its evaluation of possible board positions.) The evaluation function, which is continuously updated in light of experience, could be regarded as incorporating a form
of learning about value. However, what is being learned is not new final values
but increasingly accurate estimates of the instrumental values of reaching particular states (or of taking particular actions in particular states, or of following
particular policies). Insofar as a reinforcement-learning agent can be described as having a final goal, that goal remains constant: to maximize future reward. And
reward consists of specially designated percepts received from the environment. Therefore, the wireheading syndrome remains a likely outcome in any reinforcement agent that develops a world model sophisticated enough to suggest this alternative way of maximizing reward.
Similarly, before I even got involved with alignment or rationalism, the canonical reinforcement learning algorithm I had heard of was TD, not REINFORCE.
It also has a bad track record in ML, as the core algorithmic structure of RL algorithms capable of delivering SOTA results has not changed that much in over 3 decades.
Huh? Dreamerv3 is clearly a step in the direction of utility maximization (away from “reward is not the optimization target”), and it claims to set SOTA on a bunch of problems. Are you saying there’s something wrong with their evaluation?
In fact, just recently Cohere published Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs, which found that the classic REINFORCE algorithm actually outperforms PPO for LLM RLHF finetuning.
LLM RLHF finetuning doesn’t build new capabilities, so it should be ignored for this discussion.
Finally, this counterpoint seems irrelevant for Alex’s point in this post, which is about historical alignment arguments about historical RL algorithms. He even included disclaimers at the top about this not being an argument for optimism about future AI systems.
It’s not irrelevant. The fact that Alex Turner explicitly replies to Nick Bostrom and calls his statement nonsense means that Alex Turner does not get to use a disclaimer to decide what the subject of discussion is. Rather, the subject of discussion is whatever Bostrom was talking about. The disclaimer rather serves as a way of turning our attention away from stuff like DreamerV3 and towards stuff like DPO. However DreamerV3 seems like a closer match for Bostrom’s discussion than DPO is, so the only way turning our attention away from it can be valid is if we assume DreamerV3 is a dead end and DPO is the only future.
This is actually pointing to the difference between online and offline learning algorithms, not RL versus non-RL learning algorithms.
I was kind of pointing to both at once.
In contrast, offline RL is surprisingly stable and robust to reward misspecification.
Seems to me that the linked paper makes the argument “If you don’t include attempts to try new stuff in your training data, you won’t know what happens if you do new stuff, which means you won’t see new stuff as a good opportunity”. Which seems true but also not very interesting, because we want to build capabilities to do new stuff, so this should instead make us update to assume that the offline RL setup used in this paper won’t be what builds capabilities in the limit. (Not to say that they couldn’t still use this sort of setup as some other component than what builds the capabilities, or that they couldn’t come up with an offline RL method that does want to try new stuff—merely that this particular argument for safety bears too heavy of an alignment tax to carry us on its own.)
“If you don’t include attempts to try new stuff in your training data, you won’t know what happens if you do new stuff, which means you won’t see new stuff as a good opportunity”. Which seems true but also not very interesting, because we want to build capabilities to do new stuff, so this should instead make us update to assume that the offline RL setup used in this paper won’t be what builds capabilities in the limit.
I’m sympathetic to this argument (and think the paper overall isn’t super object-level important), but also note that they train e.g. Hopper policies to hop continuously, even though lots of the demonstrations fall over. That’s something new.
I mean sure, it can probably do some very slight generalization around beyond the boundary of its training data. But when I imagine the future of AI, I don’t imagine a very slight amount of new stuff at the margin; rather I imagine a tsunami of independently developed capabilities, at least similar to what we’ve seen in the industrial revolution. Don’t you? (Because again of course if I condition on “we’re not gonna see many new capabilities from AI”, the AI risk case mostly goes away.)
Valid point.
Critically though, neither Q, A or delta denote reward. Rather they are quantities which are supposed to estimate the effect of an action on the sum of future rewards; hence while pure REINFORCE doesn’t really maximize the sum of rewards, these other algorithms are attempts to more consistently do so, and the existence of such attempts shows that it’s likely we will see more better attempts in the future.
Bostrom’s book explicitly states what kinds of reinforcement learning algorithms he had in mind, and they are not REINFORCE:
Similarly, before I even got involved with alignment or rationalism, the canonical reinforcement learning algorithm I had heard of was TD, not REINFORCE.
Huh? Dreamerv3 is clearly a step in the direction of utility maximization (away from “reward is not the optimization target”), and it claims to set SOTA on a bunch of problems. Are you saying there’s something wrong with their evaluation?
LLM RLHF finetuning doesn’t build new capabilities, so it should be ignored for this discussion.
It’s not irrelevant. The fact that Alex Turner explicitly replies to Nick Bostrom and calls his statement nonsense means that Alex Turner does not get to use a disclaimer to decide what the subject of discussion is. Rather, the subject of discussion is whatever Bostrom was talking about. The disclaimer rather serves as a way of turning our attention away from stuff like DreamerV3 and towards stuff like DPO. However DreamerV3 seems like a closer match for Bostrom’s discussion than DPO is, so the only way turning our attention away from it can be valid is if we assume DreamerV3 is a dead end and DPO is the only future.
I was kind of pointing to both at once.
Seems to me that the linked paper makes the argument “If you don’t include attempts to try new stuff in your training data, you won’t know what happens if you do new stuff, which means you won’t see new stuff as a good opportunity”. Which seems true but also not very interesting, because we want to build capabilities to do new stuff, so this should instead make us update to assume that the offline RL setup used in this paper won’t be what builds capabilities in the limit. (Not to say that they couldn’t still use this sort of setup as some other component than what builds the capabilities, or that they couldn’t come up with an offline RL method that does want to try new stuff—merely that this particular argument for safety bears too heavy of an alignment tax to carry us on its own.)
I’m sympathetic to this argument (and think the paper overall isn’t super object-level important), but also note that they train e.g. Hopper policies to hop continuously, even though lots of the demonstrations fall over. That’s something new.
I mean sure, it can probably do some very slight generalization around beyond the boundary of its training data. But when I imagine the future of AI, I don’t imagine a very slight amount of new stuff at the margin; rather I imagine a tsunami of independently developed capabilities, at least similar to what we’ve seen in the industrial revolution. Don’t you? (Because again of course if I condition on “we’re not gonna see many new capabilities from AI”, the AI risk case mostly goes away.)