I get that a lot of AI safety rhetoric is nonsensical, but I think your strategy of obscuring technical distinctions between different algorithms and implicitly assuming that all future AI architectures will be something like GPT+DPO is counterproductive.
After making a false claim, Bostrom goes on to dismiss RL approaches to creating useful, intelligent, aligned systems. But, as a point of further fact, RL approaches constitute humanity’s current best tools for aligning AI systems today! Those approaches are pretty awesome. No RLHF, then no GPT-4 (as we know it).
RLHF as understood currently (with humans directly rating neural network outputs, a la DPO) is very different from RL as understood historically (with the network interacting autonomously in the world and receiving reward from a function of the world). It’s not an error from Bostrom’s side to say something that doesn’t apply to the former when talking about the latter, though it seems like a common error to generalize from the latter to the former.
I think it’s best to think of DPO as a low-bandwidth NN-assisted supervised learning algorithm, rather than as “true reinforcement learning” (in the classical sense). That is, under supervised learning, humans provide lots of bits by directly creating a training sample, whereas with DPO, humans provide ~1 bit by picking the network-generated sample they like the most. It’s unclear to me whether DPO has any advantage over just directly letting people edit the outputs, other than that if you did that, you’d empower trolls/partisans/etc. to intentionally break the network.
Did RL researchers in the 1990’s sit down and carefully analyze the inductive biases of PPO on huge 2026-era LLMs, conclude that PPO probably entrains LLMs which make decisions on the basis of their own reinforcement signal, and then decide to say “RL trains agents to maximize reward”? Of course not.
I was under the impression that PPO was a recently invented algorithm? Wikipedia says it was first published in 2017, which if true would mean that all pre-2017 talk about reinforcement learning was about other algorithms than PPO.
I was under the impression that PPO was a recently invented algorithm
Well, if we’re going to get historical, PPO is a relatively small variation on Williams’s REINFORCE policy gradient model-free RL algorithm from 1992 (or earlier if you count conferences etc), with a bunch of minor DL implementation tweaks that turn out to help a lot. I don’t offhand know of any ways in which PPO’s tweaks make it meaningfully different from REINFORCE from the perspective of safety, aside from the obvious ones of working better in practice. (Which is the main reason why PPO became OA’s workhorse in its model-free RL era to train small CNNs/RNNs, before they moved to model-based RL using Transformer LLMs. Policy gradient methods based on REINFORCE certainly were not novel, but they started scaling earlier.)
So, PPO is recent, yes, but that isn’t really important to anything here. TurnedTrout could just as well have used REINFORCE as the example instead.
Did RL researchers in the 1990’s sit down and carefully analyze the inductive biases of PPO on huge 2026-era LLMs, conclude that PPO probably entrains LLMs which make decisions on the basis of their own reinforcement signal, and then decide to say “RL trains agents to maximize reward”? Of course not.
I don’t know how you (TurnTrout) can say that. It certainly seems to me that plenty of researchers in 1992 were talking about either model-based RL or using model-free approaches to ground model-based RL—indeed, it’s hard to see how anything else could work in connectionism, given that model-free methods are simpler, many animals or organisms do things that can be interpreted as model-free but not model-based (while all creatures who do model-based RL, like humans, clearly also do model-free), and so on. The model-based RL was the ‘cherry on the cake’, if I may put it that way… These arguments were admittedly handwavy: “if we can’t write AGI from scratch, then we can try to learn it from scratch starting with model-free approaches like Hebbian learning, and somewhere between roughly mouse-level and human/AGI, a miracle happens, and we get full model-based reasoning”. But hey, can’t argue with success there! We have loads of nice results from DeepMind and others with this sort of flavor†.
On the other hand, I’m not able to think of any dissenters which claim that you could have AGI purely using model-free RL with no model-based RL anywhere to be seen? Like, you can imagine it working (eg. in silico environments for everything), but it’s not very plausible since it would seem like the computational requirements go astronomical fast.
Back then, they had a richer conception of RL, heavier on the model-based RL half of the field, and one more relevant to the current era, than the impoverished 2017 era of ‘let’s just PPO/Impala everything we can’t MCTS and not talk about how this is supposed to reach AGI, exactly, even if it scales reasonably well’. If you want to critique what AI researchers could imagine back in 1992, you should be reading Schmidhuber, not Bostrom. (“Computing is a pop culture”, as Kay put it, and DL, and DRL, are especially pop culture right now. Which is not necessarily a bad thing if you’re just trying to get things to work, but if you are going to make historical arguments about what people were or were not thinking in 2014, or 1992, pop culture isn’t going to cut the mustard. People back then weren’t stupid, and often had very sophisticated well-thought-out ideas & paradigms; they just had a millionth of the compute/data/infrastructure they needed to make any of it work properly...)
If you look at that REINFORCE paper, Williams isn’t even all that concerned with direct use of it to train a model to solve RL tasks.* He’s more concerned with handling non-differentiable things in general, like stochastic rather than the usual deterministic neurons we use, so you could ‘backpropagate through the environment’ models like Schmidhuber & Huber 1990, which bootstrap from random initialization using the high-variance REINFORCE-like learning signal to a superior model. (Hm, why, that sounds like the sort of thing you might do if you analyze the inductive biases of model-free approaches which entrain larger systems which have their own internal reinforcement signals which they maximize...) As Schmidhuber has been saying for decades, it’s meta-learning all the way up/down. The species-level model-free RL algorithm (evolution) creates model-free within-lifetime learning algorithms (like REINFORCE), which creates model-based within-lifetime learning algorithms (like neural net models) which create learning over families (generalization) for cross-task within-lifetime learning which create learning algorithms (ICL/history-based meta-learners**) for within-episode learning which create...
It’s no surprise that the “multiply a set of candidate entities by a fixed small percentage based on each entity’s reward” algorithm pops up everywhere from evolution to free markets to DRL to ensemble machine learning over ‘experts’, because that model-free algorithm is always available as the fallback strategy when you can’t do anything smarter (yet). Model-free is just the first step, and in many ways, least interesting & important step. I’m always weirded out to read one of these posts where something like PPO or evolution strategies is treated as the only RL algorithm around and things like expert iteration an annoying nuisance to be relegated to a footnote - ‘reward is not the optimization target!* * except when it is in these annoying exceptions like AlphaZero, but fortunately, we can ignore these, because after all, it’s not like humans or AGI or superintelligences would ever do crazy stuff like “plan” or “reason” or “search”’.
* He’d’ve probably been surprised to see people just… using it for stuff like DoTA2 on fully-differentiable BPTT RNNs. I wonder if he’s ever done any interviews on DL recently? AFAIK he’s still alive.
** Specifically, in the case of Transformers, it seems to be by self-attention doing gradient descent steps on an abstracted version of a problem; gradient descent itself isn’t a very smart algorithm, but if the abstract version is a model that encodes the correct sufficient statistics of the broader meta-problem, then it can be very easy to make Bayes-optimal predictions/choices for any specific problem.
† my paper-of-the-day website feature yesterday popped up“Learning few-shot imitation as cultural transmission”, Bhoopchand et al 2023 (excerpts) which is a nice example because they show clearly how history+diverse-environments+simple-priors-of-an-evolvable-sort elicit ‘inner’ model-like imitation learning starting from the initial ‘outer’ model-free RL algorithm (MPO, an actor-critic).
‘reward is not the optimization target!* *except when it is in these annoying exceptions like AlphaZero, but fortunately, we can ignore these, because after all, it’s not like humans or AGI or superintelligences would ever do crazy stuff like “plan” or “reason” or “search”’.
If you’re going to mock me, at least be correct when you do it!
I think that reward is still not the optimization target in AlphaZero (the way I’m using the term, at least). Learning a leaf node evaluator on a given reinforcement signal, and then bootstrapping the leaf node evaluator via MCTS on that leaf node evaluator, does not mean that the aggregate trained system
directly optimizes for the reinforcement signal, or
“cares” about that reinforcement signal,
or “does its best” to optimize the reinforcement signal (as opposed to some historical reinforcement correlate, like winning or capturing pieces or something stranger).
If most of the “optimization power” were coming from e.g. MCTS on direct reward signal, then yup, I’d agree that the reward signal is the primary optimization target of this system. That isn’t the case here.
You might use the phrase “reward as optimization target” differently than I do, but if we’re just using words differently, then it wouldn’t be appropriate to describe me as “ignoring planning.”
Learning a leaf node evaluator on a given reinforcement signal, and then bootstrapping the leaf node evaluator via MCTS on that leaf node evaluator, does not mean that the aggregate trained system
directly optimizes for the reinforcement signal, or
“cares” about that reinforcement signal,
or “does its best” to optimize the reinforcement signal (as opposed to some historical reinforcement correlate, like winning or capturing pieces or something stranger).
Yes, it does mean all of that, because MCTS is asymptotically optimal (unsurprisingly, given that it’s a tree search on the model), and will eg. happily optimize the reinforcement signal rather than proxies like capturing pieces as it learns through search that capturing pieces in particular states is not as useful as usual. If you expand out the search tree long enough (whether or not you use the AlphaZero NN to make that expansion more efficient by evaluating intermediate nodes and then back-propagating that through the current tree), then it converges on the complete, true, ground truth game tree, with all leafs evaluated with the true reward, with any imperfections in the leaf evaluator value estimate washed out. It directly optimizes the reinforcement signal, cares about nothing else, and is very pleased to lose if that results in a higher reward or not capture pieces if that results in a higher reward.*
All the NN is, is a cache or an amortization of the search algorithm. Caches are important and life would be miserable without them, but it would be absurd to say that adding a cache to a function means “that function doesn’t compute the function” or “the range is not the target of the function”.
I’m a little baffled by this argument that because the NN is not already omniscient and might mis-estimate the value of a leaf node, that apparently it’s not optimizing for the reward and that’s not the goal of the system and the system doesn’t care about reward, no matter how much it converges toward said reward as it plans/searches more, or gets better at acquiring said reward as it fixes those errors.
If most of the “optimization power” were coming from e.g. MCTS on direct reward signal, then yup, I’d agree that the reward signal is the primary optimization target of this system.
The reward signal is in fact the primary optimization target, because it is where the neural net’s value estimates derive from, and the ‘system’ corrects them eventually and converges. The dog wags the tail, sooner or later.
* I think I’ve noted this elsewhere, and mentioned my Kelly coinflip trajectories as nice visualization of how model-based RL will behave as, but to repeat: MCTS algorithms in Go/chess were noted for that sort of behavior, especially for sacrificing pieces or territory while they were ahead, in order to ‘lock down’ the game and maximize the probability of victory, rather than the margin of victory; and vice-versa, for taking big risks when they were behind. Because the tree didn’t back-propagate any rewards on ‘margin’, just on 0⁄1 rewards from victory, and didn’t care about proxy heuristics like ‘pieces captured’ if the tree search found otherwise.
He’d’ve probably been surprised to see people just… using it for stuff like DoTA2 on fully-differentiable BPTT RNNs. I wonder if he’s ever done any interviews on DL recently? AFAIK he’s still alive.
Oh dang! RIP. I guess there’s a lesson there—probably more effort should be put into interviewing the pioneers of connectionism & related fields right now, while they have some perspective and before they all die off.
Well, if we’re going to get historical, PPO is a relatively small variation on Williams’s REINFORCE policy gradient model-free RL algorithm from 1992 (or earlier if you count conferences etc)
RLHF as understood currently (with humans directly rating neural network outputs, a la DPO) is very different from RL as understood historically (with the network interacting autonomously in the world and receiving reward from a function of the world).
This is actually pointing to the difference between online and offline learning algorithms, not RL versus non-RL learning algorithms. Online learning has long been known to be less stable than offline learning. That’s what’s primarily responsible for most “reward hacking”-esque results, such as the CoastRunners degenerate policy. In contrast, offline RL is surprisingly stable and robust to reward misspecification. I think it would have been better if the alignment community had been focused on the stability issues of online learning, rather than the supposed “agentness” of RL.
I was under the impression that PPO was a recently invented algorithm? Wikipedia says it was first published in 2017, which if true would mean that all pre-2017 talk about reinforcement learning was about other algorithms than PPO.
PPO may have been invented in 2017, but there are many prior RL algorithms for which Alex’s description of “reward as learning rate multiplier” is true. In fact, PPO is essentially a tweaked version of REINFORCE, for which a bit of searching brings up Simple statistical gradient-following algorithms for connectionist reinforcement learning as the earliest available reference I can find. It was published in 1992, a full 22 years before Bostrom’s book. In fact, “reward as learning rate multiplier” is even more clearly true of most of the update algorithms described in that paper. E.g., equation 11:
Here, the reward (adjusted by a “reinforcement baseline” bij) literally just multiplies the learning rate. Beyond PPO and REINFORCE, this “x as learning rate multiplier” pattern is actually extremely common in different RL formulations. From lecture 7 of David Silver’s RL course:
To be honest, it was a major blackpill for me to see the rationalist community, whose whole whole founding premise was that they were supposed to be good at making efficient use of the available evidence, so completely missing this very straightforward interpretation of RL (at least, I’d never heard of it from alignment literature until I myself came up with it when I realized that the mechanistic function of per-trajectory rewards in a given batched update was to provide the weights of a linear combination of the trajectory gradients. Update: Gwern’s description here is actually somewhat similar).
implicitly assuming that all future AI architectures will be something like GPT+DPO is counterproductive.
When I bring up the “actual RL algorithms don’t seem very dangerous or agenty to me” point, people often respond with “Future algorithms will be different and more dangerous”.
I think this is a bad response for many reasons. In general, it serves as an unlimited excuse to never update on currently available evidence. It also has a bad track record in ML, as the core algorithmic structure of RL algorithms capable of delivering SOTA results has not changed that much in over 3 decades. In fact, just recently Cohere published Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs, which found that the classic REINFORCE algorithm actually outperforms PPO for LLM RLHF finetuning. Finally, this counterpoint seems irrelevant for Alex’s point in this post, which is about historical alignment arguments about historical RL algorithms. He even included disclaimers at the top about this not being an argument for optimism about future AI systems.
I wasn’t around in the community in 2010-2015, so I don’t know what the state of RL knowledge was at that time. However, I dispute the claim that rationalists “completely miss[ed] this [..] interpretation”:
To be honest, it was a major blackpill for me to see the rationalist community, whose whole whole founding premise was that they were supposed to be good at making efficient use of the available evidence, so completely missing this very straightforward interpretation of RL [..] the mechanistic function of per-trajectory rewards in a given batched update was to provide the weights of a linear combination of the trajectory gradients.
Ever since I entered the community, I’ve definitely heard of people talking about policy gradient as “upweighting trajectories with positive reward/downweighting trajectories with negative reward” since 2016, albeit in person. I remember being shown a picture sometime in 2016⁄17 that looks something like this when someone (maybe Paul?) was explaining REINFORCE to me: (I couldn’t find it, so reconstructing it from memory)
In addition, I would be surprised if any of the CHAI PhD students when I was at CHAI from 2017->2021, many of whom have taken deep RL classes at Berkeley, missed this “upweight trajectories in proportion to their reward” intepretation? Most of us at the time have also implemented various RL algorithms from scratch, and there the “weighting trajectory gradients” perspective pops out immediately.
As another data point, when I taught MLAB/WMLB in 2022⁄3, my slides also contained this interpretation of REINFORCE (after deriving it) in so many words:
Insofar as people are making mistakes about reward and RL, it’s not due to having never been exposed to this perspective.
That being said, I do agree that there’s been substantial confusion in this community, mainly of two kinds:
Confusing the objective function being optimized to train a policy with how the policy is mechanistically implemented: Just because the outer loop is modifying/selecting for a policy to score highly on some objective function, does not necessarily mean that the resulting policy will end up selecting actions based on said objective.
Confusing “this policy is optimized for X” with “this policy is optimal for X”: this is the actual mistake I think Bostom is making in Alex’s example—it’s true that an agent that wireheads achieves higher reward than on the training distribution (and the optimal agent for the reward achieves reward at least as good as wireheading). And I think that Alex and you would also agree with me that it’s sometimes valuable to reason about the global optima in policy space. But it’s a mistake to identify the outputs of optimization with the optimal solution to an optimization problem, and many people were making this jump without noticing it.
Again, I contend these confusions were not due to a lack of exposure to the “rewards as weighting trajectories” perspective. Instead, the reasons I remember hearing back in 2017-2018 for why we should jump from “RL is optimizing agents for X” to “RL outputs agents that both optimize X and are optimal for X”:
We’d be really confused if we couldn’t reason about “optimal” agents, so we should solve that first. This is the main justification I heard from the MIRI people about why they studied idealized agents. Oftentimes globally optimal solutions are easier to reason about than local optima or saddle points, or are useful for operationalizing concepts. Because a lot of the community was focused on philosophical deconfusion (often w/ minimal knowledge of ML or RL), many people naturally came to jump the gap between “the thing we’re studying” and “the thing we care about”.
Reasoning about optima gives a better picture of powerful, future AGIs. Insofar as we’re far from transformative AI, you might expect that current AIs are a poor model for how transformative AI will look. In particular, you might expect that modeling transformative AI as optimal leads to clearer reasoning than analogizing them to current systems. This point has become increasingly tenuous since GPT-2, but
Some off-policy RL algorithms are well described as having a “reward” maximizing component: And, these were the approaches that people were using and thinking about at the time. For example, the most hyped results in deep learning in the mid 2010s were probably DQN and AlphaGo/GoZero/Zero. And many people believed that future AIs would be implemented via model-based RL. All of these approaches result in policies that contain an internal component which is searching for actions that maximize some learned objective. Given that ~everyone uses policy gradient variants for RL on SOTA LLMs, this does turn out to be incorrect ex post. But if the most impressive AIs seem to be implemented in ways that correspond to internal reward maximization, it does seem very understandable to think about AGIs as explicit reward optimizers.
This is how many RL pioneers reasoned about their algorithms. I agree with Alex that this is probably from the control theory routes, where a PID controller is well modeled as picking trajectories that minimize cost, in a way that early simple RL policies are not well modeled as internally picking trajectories that maximize reward.
Also, sometimes it is just the words being similar; it can be hard to keep track of the differences between “optimizing for”, “optimized for”, and “optimal for” in normal conversation.
I think if you want to prevent the community from repeating these confusions, this looks less like “here’s an alternative perspective through which you can view policy gradient” and more “here’s why reasoning about AGI as ‘optimal’ agents is misleading” and “here’s why reasoning about your 1 hidden layer neural network policy as if it were optimizing the reward is bad”.
An aside:
In general, I think that many ML-knowledgeable people (arguably myself included) correctly notice that the community is making many mistakes in reasoning, that they resolve internally using ML terminology or frames from the ML literature. But without reasoning carefully about the problem, the terminology or frames themselves are insufficient to resolve the confusion. (Notice how many Deep RL people make the same mistake!) And, as Alex and you have argued before, the standard ML frames and terminology introduce their own confusions (e.g. ‘attention’).
A shallow understanding of “policy gradient is just upweighting trajectories” may in fact lead to making the opposite mistake: assuming that it can never lead to intelligent, optimizer-y behavior. (Again, notice how many ML academics made exactly this mistake) Or, more broadly, thinking about ML algorithms purely from the low-level, mechanistic frame can lead to confusions along the lines of “next token prediction can only lead to statistical parrots without true intelligence”. Doubly so if you’ve only worked with policy gradient or language modeling with tiny models.
Ever since I entered the community, I’ve definitely heard of people talking about policy gradient as “upweighting trajectories with positive reward/downweighting trajectories with negative reward” since 2016, albeit in person. I remember being shown a picture sometime in 2016⁄17 that looks something like this when someone (maybe Paul?) was explaining REINFORCE to me: (I couldn’t find it, so reconstructing it from memory)
Knowing how to reason about “upweighting trajectories” when explicitly prompted or in narrow contexts of algorithmic implementation is not sufficient to conclude “people basically knew this perspective” (but it’s certainly evidence). See Outside the Laboratory:
Now suppose we discover that a Ph.D. economist buys a lottery ticket every week. We have to ask ourselves: Does this person really understand expected utility, on a gut level? Or have they just been trained to perform certain algebra tricks?
Knowing “vanilla PG upweights trajectories”, and being able to explain the math—this is not enough to save someone from the rampant reward confusions. Certainly Yoshua Bengio could explain vanilla PG, and yet he goes on about how RL (almost certainly, IIRC) trains reward maximizers.
I contend these confusions were not due to a lack of exposure to the “rewards as weighting trajectories” perspective.
I personally disagree—although I think your list of alternative explanations is reasonable. If alignment theorists had been using this (simple and obvious-in-retrospect) “reward chisels circuits into the network” perspective, if they had really been using it and felt it deep within their bones, I think they would not have been particularly tempted by this family of mistakes.
What’s the difference between “Alice is falling victim to confusions/reasoning mistakes about X” and “Alice disagrees with me about X”?
I feel like using the former puts undue social pressure on observers to conclude that you’re right, and makes it less likely they correctly adjudicate between the perspectives.
(Perhaps you can empathise with me here, since arguably certain people taking this sort of tone is one of the reasons AI x-risk arguments have not always been vetted as carefully as they should!)
What’s the difference between “Alice is falling victim to confusions/reasoning mistakes about X” and “Alice disagrees with me about X”?
I suspect that, for Alex Turner, writing the former instead of the latter is a signal that he thinks he has identified the specific confusion/reasoning mistake his interlocutor is engaged in, likely as a result of having seen closely analogous arguments in the past from other people who turned out (or even admitted) to be confused about these matters after conversations with him.
In fact, PPO is essentially a tweaked version of REINFORCE,
Valid point.
Beyond PPO and REINFORCE, this “x as learning rate multiplier” pattern is actually extremely common in different RL formulations. From lecture 7 of David Silver’s RL course:
Critically though, neither Q, A or delta denote reward. Rather they are quantities which are supposed to estimate the effect of an action on the sum of future rewards; hence while pure REINFORCE doesn’t really maximize the sum of rewards, these other algorithms are attempts to more consistently do so, and the existence of such attempts shows that it’s likely we will see more better attempts in the future.
It was published in 1992, a full 22 years before Bostrom’s book.
Bostrom’s book explicitly states what kinds of reinforcement learning algorithms he had in mind, and they are not REINFORCE:
Often, the learning algorithm involves the
gradual construction of some kind of evaluation function, which assigns values
to states, state–action pairs, or policies. (For instance, a program can learn to
play backgammon by using reinforcement learning to incrementally improve its evaluation of possible board positions.) The evaluation function, which is continuously updated in light of experience, could be regarded as incorporating a form
of learning about value. However, what is being learned is not new final values
but increasingly accurate estimates of the instrumental values of reaching particular states (or of taking particular actions in particular states, or of following
particular policies). Insofar as a reinforcement-learning agent can be described as having a final goal, that goal remains constant: to maximize future reward. And
reward consists of specially designated percepts received from the environment. Therefore, the wireheading syndrome remains a likely outcome in any reinforcement agent that develops a world model sophisticated enough to suggest this alternative way of maximizing reward.
Similarly, before I even got involved with alignment or rationalism, the canonical reinforcement learning algorithm I had heard of was TD, not REINFORCE.
It also has a bad track record in ML, as the core algorithmic structure of RL algorithms capable of delivering SOTA results has not changed that much in over 3 decades.
Huh? Dreamerv3 is clearly a step in the direction of utility maximization (away from “reward is not the optimization target”), and it claims to set SOTA on a bunch of problems. Are you saying there’s something wrong with their evaluation?
In fact, just recently Cohere published Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs, which found that the classic REINFORCE algorithm actually outperforms PPO for LLM RLHF finetuning.
LLM RLHF finetuning doesn’t build new capabilities, so it should be ignored for this discussion.
Finally, this counterpoint seems irrelevant for Alex’s point in this post, which is about historical alignment arguments about historical RL algorithms. He even included disclaimers at the top about this not being an argument for optimism about future AI systems.
It’s not irrelevant. The fact that Alex Turner explicitly replies to Nick Bostrom and calls his statement nonsense means that Alex Turner does not get to use a disclaimer to decide what the subject of discussion is. Rather, the subject of discussion is whatever Bostrom was talking about. The disclaimer rather serves as a way of turning our attention away from stuff like DreamerV3 and towards stuff like DPO. However DreamerV3 seems like a closer match for Bostrom’s discussion than DPO is, so the only way turning our attention away from it can be valid is if we assume DreamerV3 is a dead end and DPO is the only future.
This is actually pointing to the difference between online and offline learning algorithms, not RL versus non-RL learning algorithms.
I was kind of pointing to both at once.
In contrast, offline RL is surprisingly stable and robust to reward misspecification.
Seems to me that the linked paper makes the argument “If you don’t include attempts to try new stuff in your training data, you won’t know what happens if you do new stuff, which means you won’t see new stuff as a good opportunity”. Which seems true but also not very interesting, because we want to build capabilities to do new stuff, so this should instead make us update to assume that the offline RL setup used in this paper won’t be what builds capabilities in the limit. (Not to say that they couldn’t still use this sort of setup as some other component than what builds the capabilities, or that they couldn’t come up with an offline RL method that does want to try new stuff—merely that this particular argument for safety bears too heavy of an alignment tax to carry us on its own.)
“If you don’t include attempts to try new stuff in your training data, you won’t know what happens if you do new stuff, which means you won’t see new stuff as a good opportunity”. Which seems true but also not very interesting, because we want to build capabilities to do new stuff, so this should instead make us update to assume that the offline RL setup used in this paper won’t be what builds capabilities in the limit.
I’m sympathetic to this argument (and think the paper overall isn’t super object-level important), but also note that they train e.g. Hopper policies to hop continuously, even though lots of the demonstrations fall over. That’s something new.
I mean sure, it can probably do some very slight generalization around beyond the boundary of its training data. But when I imagine the future of AI, I don’t imagine a very slight amount of new stuff at the margin; rather I imagine a tsunami of independently developed capabilities, at least similar to what we’ve seen in the industrial revolution. Don’t you? (Because again of course if I condition on “we’re not gonna see many new capabilities from AI”, the AI risk case mostly goes away.)
I think this is a key crux of disagreement on alignment:
When I bring up the “actual RL algorithms don’t seem very dangerous or agenty to me” point, people often respond with “Future algorithms will be different and more dangerous”.
I think this is a bad response for many reasons.
On the one hand, empiricism and assuming that the future will be much like the past have a great track record.
On the other, predicting the future is the name of the game in alignment. And while the future is reliably much like the past, it’s never been exactly like the past.
So opinions pull in both directions.
On the object level, I certainly agree that existing RL systems aren’t very agenty or dangerous. It seems like you’re predicting that people won’t make AI that’s particularly agentic any time soon. It seems to me that they’ll certainly want to. And I think it will be easy if non-agentic foundation models get good. Turning a smart foundation model into an agent is as simple as the prompt “make and execute a plan that accomplishes goal [x]. Use [these APIs] to gather information and take actions”.
I think this is what Alex was pointing to in the OP by saying
I’m worried about people turning AIs into agentic systems using scaffolding and other tricks, and then instructing the systems to complete large-scale projects.
I think this is the default future, so much so that I don’t think it matters if agency would emerge through RL. We’ll build it in. Humans are burdened with excessive curiousity, optimism, and ambition. Especially the type of humans that head AI/AGI projects.
offline RL is surprisingly stable and robust to reward misspecification
Wow, what a wild paper. The basic idea—that “pessimism” about off-distribution state/action pairs induces pessimistically-trained RL agents to learn policies that hang around in the training distribution for a long time, even if that goes against their reward function—is a fairly obvious one. But what’s not obvious is the wide variety of algorithms this applies to.
I genuinely don’t believe their decision transformer results. I.e. I think with p~0.8, if they (or the authors of the paper whose hyperparameters they copied) made better design choices, they would have gotten a decision transformer that was actually sensitive to reward. But on the flip side, with p~0.2 they just showed that decision transformers don’t work! (For these tasks.)
I was under the impression that PPO was a recently invented algorithm? Wikipedia says it was first published in 2017, which if true would mean that all pre-2017 talk about reinforcement learning was about other algorithms than PPO.
Wikipedia says:
PPO was developed by John Schulman in 2017,[1] and had become the default reinforcement learning algorithm at American artificial intelligence company OpenAI.
I get that a lot of AI safety rhetoric is nonsensical, but I think your strategy of obscuring technical distinctions between different algorithms and implicitly assuming that all future AI architectures will be something like GPT+DPO is counterproductive.
RLHF as understood currently (with humans directly rating neural network outputs, a la DPO) is very different from RL as understood historically (with the network interacting autonomously in the world and receiving reward from a function of the world). It’s not an error from Bostrom’s side to say something that doesn’t apply to the former when talking about the latter, though it seems like a common error to generalize from the latter to the former.
I think it’s best to think of DPO as a low-bandwidth NN-assisted supervised learning algorithm, rather than as “true reinforcement learning” (in the classical sense). That is, under supervised learning, humans provide lots of bits by directly creating a training sample, whereas with DPO, humans provide ~1 bit by picking the network-generated sample they like the most. It’s unclear to me whether DPO has any advantage over just directly letting people edit the outputs, other than that if you did that, you’d empower trolls/partisans/etc. to intentionally break the network.
I was under the impression that PPO was a recently invented algorithm? Wikipedia says it was first published in 2017, which if true would mean that all pre-2017 talk about reinforcement learning was about other algorithms than PPO.
Well, if we’re going to get historical, PPO is a relatively small variation on Williams’s REINFORCE policy gradient model-free RL algorithm from 1992 (or earlier if you count conferences etc), with a bunch of minor DL implementation tweaks that turn out to help a lot. I don’t offhand know of any ways in which PPO’s tweaks make it meaningfully different from REINFORCE from the perspective of safety, aside from the obvious ones of working better in practice. (Which is the main reason why PPO became OA’s workhorse in its model-free RL era to train small CNNs/RNNs, before they moved to model-based RL using Transformer LLMs. Policy gradient methods based on REINFORCE certainly were not novel, but they started scaling earlier.)
So, PPO is recent, yes, but that isn’t really important to anything here. TurnedTrout could just as well have used REINFORCE as the example instead.
I don’t know how you (TurnTrout) can say that. It certainly seems to me that plenty of researchers in 1992 were talking about either model-based RL or using model-free approaches to ground model-based RL—indeed, it’s hard to see how anything else could work in connectionism, given that model-free methods are simpler, many animals or organisms do things that can be interpreted as model-free but not model-based (while all creatures who do model-based RL, like humans, clearly also do model-free), and so on. The model-based RL was the ‘cherry on the cake’, if I may put it that way… These arguments were admittedly handwavy: “if we can’t write AGI from scratch, then we can try to learn it from scratch starting with model-free approaches like Hebbian learning, and somewhere between roughly mouse-level and human/AGI, a miracle happens, and we get full model-based reasoning”. But hey, can’t argue with success there! We have loads of nice results from DeepMind and others with this sort of flavor†.
On the other hand, I’m not able to think of any dissenters which claim that you could have AGI purely using model-free RL with no model-based RL anywhere to be seen? Like, you can imagine it working (eg. in silico environments for everything), but it’s not very plausible since it would seem like the computational requirements go astronomical fast.
Back then, they had a richer conception of RL, heavier on the model-based RL half of the field, and one more relevant to the current era, than the impoverished 2017 era of ‘let’s just PPO/Impala everything we can’t MCTS and not talk about how this is supposed to reach AGI, exactly, even if it scales reasonably well’. If you want to critique what AI researchers could imagine back in 1992, you should be reading Schmidhuber, not Bostrom. (“Computing is a pop culture”, as Kay put it, and DL, and DRL, are especially pop culture right now. Which is not necessarily a bad thing if you’re just trying to get things to work, but if you are going to make historical arguments about what people were or were not thinking in 2014, or 1992, pop culture isn’t going to cut the mustard. People back then weren’t stupid, and often had very sophisticated well-thought-out ideas & paradigms; they just had a millionth of the compute/data/infrastructure they needed to make any of it work properly...)
If you look at that REINFORCE paper, Williams isn’t even all that concerned with direct use of it to train a model to solve RL tasks.* He’s more concerned with handling non-differentiable things in general, like stochastic rather than the usual deterministic neurons we use, so you could ‘backpropagate through the environment’ models like Schmidhuber & Huber 1990, which bootstrap from random initialization using the high-variance REINFORCE-like learning signal to a superior model. (Hm, why, that sounds like the sort of thing you might do if you analyze the inductive biases of model-free approaches which entrain larger systems which have their own internal reinforcement signals which they maximize...) As Schmidhuber has been saying for decades, it’s meta-learning all the way up/down. The species-level model-free RL algorithm (evolution) creates model-free within-lifetime learning algorithms (like REINFORCE), which creates model-based within-lifetime learning algorithms (like neural net models) which create learning over families (generalization) for cross-task within-lifetime learning which create learning algorithms (ICL/history-based meta-learners**) for within-episode learning which create...
It’s no surprise that the “multiply a set of candidate entities by a fixed small percentage based on each entity’s reward” algorithm pops up everywhere from evolution to free markets to DRL to ensemble machine learning over ‘experts’, because that model-free algorithm is always available as the fallback strategy when you can’t do anything smarter (yet). Model-free is just the first step, and in many ways, least interesting & important step. I’m always weirded out to read one of these posts where something like PPO or evolution strategies is treated as the only RL algorithm around and things like expert iteration an annoying nuisance to be relegated to a footnote - ‘reward is not the optimization target!* * except when it is in these annoying exceptions like AlphaZero, but fortunately, we can ignore these, because after all, it’s not like humans or AGI or superintelligences would ever do crazy stuff like “plan” or “reason” or “search”’.
* He’d’ve probably been surprised to see people just… using it for stuff like DoTA2 on fully-differentiable BPTT RNNs. I wonder if he’s ever done any interviews on DL recently? AFAIK he’s still alive.
** Specifically, in the case of Transformers, it seems to be by self-attention doing gradient descent steps on an abstracted version of a problem; gradient descent itself isn’t a very smart algorithm, but if the abstract version is a model that encodes the correct sufficient statistics of the broader meta-problem, then it can be very easy to make Bayes-optimal predictions/choices for any specific problem.
† my paper-of-the-day website feature yesterday popped up “Learning few-shot imitation as cultural transmission”, Bhoopchand et al 2023 (excerpts) which is a nice example because they show clearly how history+diverse-environments+simple-priors-of-an-evolvable-sort elicit ‘inner’ model-like imitation learning starting from the initial ‘outer’ model-free RL algorithm (MPO, an actor-critic).
If you’re going to mock me, at least be correct when you do it!
I think that reward is still not the optimization target in AlphaZero (the way I’m using the term, at least). Learning a leaf node evaluator on a given reinforcement signal, and then bootstrapping the leaf node evaluator via MCTS on that leaf node evaluator, does not mean that the aggregate trained system
directly optimizes for the reinforcement signal, or
“cares” about that reinforcement signal,
or “does its best” to optimize the reinforcement signal (as opposed to some historical reinforcement correlate, like winning or capturing pieces or something stranger).
If most of the “optimization power” were coming from e.g. MCTS on direct reward signal, then yup, I’d agree that the reward signal is the primary optimization target of this system. That isn’t the case here.
You might use the phrase “reward as optimization target” differently than I do, but if we’re just using words differently, then it wouldn’t be appropriate to describe me as “ignoring planning.”
Yes, it does mean all of that, because MCTS is asymptotically optimal (unsurprisingly, given that it’s a tree search on the model), and will eg. happily optimize the reinforcement signal rather than proxies like capturing pieces as it learns through search that capturing pieces in particular states is not as useful as usual. If you expand out the search tree long enough (whether or not you use the AlphaZero NN to make that expansion more efficient by evaluating intermediate nodes and then back-propagating that through the current tree), then it converges on the complete, true, ground truth game tree, with all leafs evaluated with the true reward, with any imperfections in the leaf evaluator value estimate washed out. It directly optimizes the reinforcement signal, cares about nothing else, and is very pleased to lose if that results in a higher reward or not capture pieces if that results in a higher reward.*
All the NN is, is a cache or an amortization of the search algorithm. Caches are important and life would be miserable without them, but it would be absurd to say that adding a cache to a function means “that function doesn’t compute the function” or “the range is not the target of the function”.
I’m a little baffled by this argument that because the NN is not already omniscient and might mis-estimate the value of a leaf node, that apparently it’s not optimizing for the reward and that’s not the goal of the system and the system doesn’t care about reward, no matter how much it converges toward said reward as it plans/searches more, or gets better at acquiring said reward as it fixes those errors.
The reward signal is in fact the primary optimization target, because it is where the neural net’s value estimates derive from, and the ‘system’ corrects them eventually and converges. The dog wags the tail, sooner or later.
* I think I’ve noted this elsewhere, and mentioned my Kelly coinflip trajectories as nice visualization of how model-based RL will behave as, but to repeat: MCTS algorithms in Go/chess were noted for that sort of behavior, especially for sacrificing pieces or territory while they were ahead, in order to ‘lock down’ the game and maximize the probability of victory, rather than the margin of victory; and vice-versa, for taking big risks when they were behind. Because the tree didn’t back-propagate any rewards on ‘margin’, just on 0⁄1 rewards from victory, and didn’t care about proxy heuristics like ‘pieces captured’ if the tree search found otherwise.
Sadly, Williams passed away this February: https://www.currentobituary.com/member/obit/282438
Oh dang! RIP. I guess there’s a lesson there—probably more effort should be put into interviewing the pioneers of connectionism & related fields right now, while they have some perspective and before they all die off.
Oops.
Well, I didn’t say it, TurnTrout did.
This is actually pointing to the difference between online and offline learning algorithms, not RL versus non-RL learning algorithms. Online learning has long been known to be less stable than offline learning. That’s what’s primarily responsible for most “reward hacking”-esque results, such as the CoastRunners degenerate policy. In contrast, offline RL is surprisingly stable and robust to reward misspecification. I think it would have been better if the alignment community had been focused on the stability issues of online learning, rather than the supposed “agentness” of RL.
PPO may have been invented in 2017, but there are many prior RL algorithms for which Alex’s description of “reward as learning rate multiplier” is true. In fact, PPO is essentially a tweaked version of REINFORCE, for which a bit of searching brings up Simple statistical gradient-following algorithms for connectionist reinforcement learning as the earliest available reference I can find. It was published in 1992, a full 22 years before Bostrom’s book. In fact, “reward as learning rate multiplier” is even more clearly true of most of the update algorithms described in that paper. E.g., equation 11:
Here, the reward (adjusted by a “reinforcement baseline” bij) literally just multiplies the learning rate. Beyond PPO and REINFORCE, this “x as learning rate multiplier” pattern is actually extremely common in different RL formulations. From lecture 7 of David Silver’s RL course:
To be honest, it was a major blackpill for me to see the rationalist community, whose whole whole founding premise was that they were supposed to be good at making efficient use of the available evidence, so completely missing this very straightforward interpretation of RL (at least, I’d never heard of it from alignment literature until I myself came up with it when I realized that the mechanistic function of per-trajectory rewards in a given batched update was to provide the weights of a linear combination of the trajectory gradients. Update: Gwern’s description here is actually somewhat similar).
When I bring up the “actual RL algorithms don’t seem very dangerous or agenty to me” point, people often respond with “Future algorithms will be different and more dangerous”.
I think this is a bad response for many reasons. In general, it serves as an unlimited excuse to never update on currently available evidence. It also has a bad track record in ML, as the core algorithmic structure of RL algorithms capable of delivering SOTA results has not changed that much in over 3 decades. In fact, just recently Cohere published Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs, which found that the classic REINFORCE algorithm actually outperforms PPO for LLM RLHF finetuning. Finally, this counterpoint seems irrelevant for Alex’s point in this post, which is about historical alignment arguments about historical RL algorithms. He even included disclaimers at the top about this not being an argument for optimism about future AI systems.
I wasn’t around in the community in 2010-2015, so I don’t know what the state of RL knowledge was at that time. However, I dispute the claim that rationalists “completely miss[ed] this [..] interpretation”:
Ever since I entered the community, I’ve definitely heard of people talking about policy gradient as “upweighting trajectories with positive reward/downweighting trajectories with negative reward” since 2016, albeit in person. I remember being shown a picture sometime in 2016⁄17 that looks something like this when someone (maybe Paul?) was explaining REINFORCE to me: (I couldn’t find it, so reconstructing it from memory)
In addition, I would be surprised if any of the CHAI PhD students when I was at CHAI from 2017->2021, many of whom have taken deep RL classes at Berkeley, missed this “upweight trajectories in proportion to their reward” intepretation? Most of us at the time have also implemented various RL algorithms from scratch, and there the “weighting trajectory gradients” perspective pops out immediately.
As another data point, when I taught MLAB/WMLB in 2022⁄3, my slides also contained this interpretation of REINFORCE (after deriving it) in so many words:
Insofar as people are making mistakes about reward and RL, it’s not due to having never been exposed to this perspective.
That being said, I do agree that there’s been substantial confusion in this community, mainly of two kinds:
Confusing the objective function being optimized to train a policy with how the policy is mechanistically implemented: Just because the outer loop is modifying/selecting for a policy to score highly on some objective function, does not necessarily mean that the resulting policy will end up selecting actions based on said objective.
Confusing “this policy is optimized for X” with “this policy is optimal for X”: this is the actual mistake I think Bostom is making in Alex’s example—it’s true that an agent that wireheads achieves higher reward than on the training distribution (and the optimal agent for the reward achieves reward at least as good as wireheading). And I think that Alex and you would also agree with me that it’s sometimes valuable to reason about the global optima in policy space. But it’s a mistake to identify the outputs of optimization with the optimal solution to an optimization problem, and many people were making this jump without noticing it.
Again, I contend these confusions were not due to a lack of exposure to the “rewards as weighting trajectories” perspective. Instead, the reasons I remember hearing back in 2017-2018 for why we should jump from “RL is optimizing agents for X” to “RL outputs agents that both optimize X and are optimal for X”:
We’d be really confused if we couldn’t reason about “optimal” agents, so we should solve that first. This is the main justification I heard from the MIRI people about why they studied idealized agents. Oftentimes globally optimal solutions are easier to reason about than local optima or saddle points, or are useful for operationalizing concepts. Because a lot of the community was focused on philosophical deconfusion (often w/ minimal knowledge of ML or RL), many people naturally came to jump the gap between “the thing we’re studying” and “the thing we care about”.
Reasoning about optima gives a better picture of powerful, future AGIs. Insofar as we’re far from transformative AI, you might expect that current AIs are a poor model for how transformative AI will look. In particular, you might expect that modeling transformative AI as optimal leads to clearer reasoning than analogizing them to current systems. This point has become increasingly tenuous since GPT-2, but
Some off-policy RL algorithms are well described as having a “reward” maximizing component: And, these were the approaches that people were using and thinking about at the time. For example, the most hyped results in deep learning in the mid 2010s were probably DQN and AlphaGo/GoZero/Zero. And many people believed that future AIs would be implemented via model-based RL. All of these approaches result in policies that contain an internal component which is searching for actions that maximize some learned objective. Given that ~everyone uses policy gradient variants for RL on SOTA LLMs, this does turn out to be incorrect ex post. But if the most impressive AIs seem to be implemented in ways that correspond to internal reward maximization, it does seem very understandable to think about AGIs as explicit reward optimizers.
This is how many RL pioneers reasoned about their algorithms. I agree with Alex that this is probably from the control theory routes, where a PID controller is well modeled as picking trajectories that minimize cost, in a way that early simple RL policies are not well modeled as internally picking trajectories that maximize reward.
Also, sometimes it is just the words being similar; it can be hard to keep track of the differences between “optimizing for”, “optimized for”, and “optimal for” in normal conversation.
I think if you want to prevent the community from repeating these confusions, this looks less like “here’s an alternative perspective through which you can view policy gradient” and more “here’s why reasoning about AGI as ‘optimal’ agents is misleading” and “here’s why reasoning about your 1 hidden layer neural network policy as if it were optimizing the reward is bad”.
An aside:
In general, I think that many ML-knowledgeable people (arguably myself included) correctly notice that the community is making many mistakes in reasoning, that they resolve internally using ML terminology or frames from the ML literature. But without reasoning carefully about the problem, the terminology or frames themselves are insufficient to resolve the confusion. (Notice how many Deep RL people make the same mistake!) And, as Alex and you have argued before, the standard ML frames and terminology introduce their own confusions (e.g. ‘attention’).
A shallow understanding of “policy gradient is just upweighting trajectories” may in fact lead to making the opposite mistake: assuming that it can never lead to intelligent, optimizer-y behavior. (Again, notice how many ML academics made exactly this mistake) Or, more broadly, thinking about ML algorithms purely from the low-level, mechanistic frame can lead to confusions along the lines of “next token prediction can only lead to statistical parrots without true intelligence”. Doubly so if you’ve only worked with policy gradient or language modeling with tiny models.
Knowing how to reason about “upweighting trajectories” when explicitly prompted or in narrow contexts of algorithmic implementation is not sufficient to conclude “people basically knew this perspective” (but it’s certainly evidence). See Outside the Laboratory:
Knowing “vanilla PG upweights trajectories”, and being able to explain the math—this is not enough to save someone from the rampant reward confusions. Certainly Yoshua Bengio could explain vanilla PG, and yet he goes on about how RL (almost certainly, IIRC) trains reward maximizers.
I personally disagree—although I think your list of alternative explanations is reasonable. If alignment theorists had been using this (simple and obvious-in-retrospect) “reward chisels circuits into the network” perspective, if they had really been using it and felt it deep within their bones, I think they would not have been particularly tempted by this family of mistakes.
What’s the difference between “Alice is falling victim to confusions/reasoning mistakes about X” and “Alice disagrees with me about X”?
I feel like using the former puts undue social pressure on observers to conclude that you’re right, and makes it less likely they correctly adjudicate between the perspectives.
(Perhaps you can empathise with me here, since arguably certain people taking this sort of tone is one of the reasons AI x-risk arguments have not always been vetted as carefully as they should!)
I suspect that, for Alex Turner, writing the former instead of the latter is a signal that he thinks he has identified the specific confusion/reasoning mistake his interlocutor is engaged in, likely as a result of having seen closely analogous arguments in the past from other people who turned out (or even admitted) to be confused about these matters after conversations with him.
Do you have a reference to the problematic argument that Yoshua Bengio makes?
Valid point.
Critically though, neither Q, A or delta denote reward. Rather they are quantities which are supposed to estimate the effect of an action on the sum of future rewards; hence while pure REINFORCE doesn’t really maximize the sum of rewards, these other algorithms are attempts to more consistently do so, and the existence of such attempts shows that it’s likely we will see more better attempts in the future.
Bostrom’s book explicitly states what kinds of reinforcement learning algorithms he had in mind, and they are not REINFORCE:
Similarly, before I even got involved with alignment or rationalism, the canonical reinforcement learning algorithm I had heard of was TD, not REINFORCE.
Huh? Dreamerv3 is clearly a step in the direction of utility maximization (away from “reward is not the optimization target”), and it claims to set SOTA on a bunch of problems. Are you saying there’s something wrong with their evaluation?
LLM RLHF finetuning doesn’t build new capabilities, so it should be ignored for this discussion.
It’s not irrelevant. The fact that Alex Turner explicitly replies to Nick Bostrom and calls his statement nonsense means that Alex Turner does not get to use a disclaimer to decide what the subject of discussion is. Rather, the subject of discussion is whatever Bostrom was talking about. The disclaimer rather serves as a way of turning our attention away from stuff like DreamerV3 and towards stuff like DPO. However DreamerV3 seems like a closer match for Bostrom’s discussion than DPO is, so the only way turning our attention away from it can be valid is if we assume DreamerV3 is a dead end and DPO is the only future.
I was kind of pointing to both at once.
Seems to me that the linked paper makes the argument “If you don’t include attempts to try new stuff in your training data, you won’t know what happens if you do new stuff, which means you won’t see new stuff as a good opportunity”. Which seems true but also not very interesting, because we want to build capabilities to do new stuff, so this should instead make us update to assume that the offline RL setup used in this paper won’t be what builds capabilities in the limit. (Not to say that they couldn’t still use this sort of setup as some other component than what builds the capabilities, or that they couldn’t come up with an offline RL method that does want to try new stuff—merely that this particular argument for safety bears too heavy of an alignment tax to carry us on its own.)
I’m sympathetic to this argument (and think the paper overall isn’t super object-level important), but also note that they train e.g. Hopper policies to hop continuously, even though lots of the demonstrations fall over. That’s something new.
I mean sure, it can probably do some very slight generalization around beyond the boundary of its training data. But when I imagine the future of AI, I don’t imagine a very slight amount of new stuff at the margin; rather I imagine a tsunami of independently developed capabilities, at least similar to what we’ve seen in the industrial revolution. Don’t you? (Because again of course if I condition on “we’re not gonna see many new capabilities from AI”, the AI risk case mostly goes away.)
I think this is a key crux of disagreement on alignment:
On the one hand, empiricism and assuming that the future will be much like the past have a great track record.
On the other, predicting the future is the name of the game in alignment. And while the future is reliably much like the past, it’s never been exactly like the past.
So opinions pull in both directions.
On the object level, I certainly agree that existing RL systems aren’t very agenty or dangerous. It seems like you’re predicting that people won’t make AI that’s particularly agentic any time soon. It seems to me that they’ll certainly want to. And I think it will be easy if non-agentic foundation models get good. Turning a smart foundation model into an agent is as simple as the prompt “make and execute a plan that accomplishes goal [x]. Use [these APIs] to gather information and take actions”.
I think this is what Alex was pointing to in the OP by saying
I think this is the default future, so much so that I don’t think it matters if agency would emerge through RL. We’ll build it in. Humans are burdened with excessive curiousity, optimism, and ambition. Especially the type of humans that head AI/AGI projects.
Wow, what a wild paper. The basic idea—that “pessimism” about off-distribution state/action pairs induces pessimistically-trained RL agents to learn policies that hang around in the training distribution for a long time, even if that goes against their reward function—is a fairly obvious one. But what’s not obvious is the wide variety of algorithms this applies to.
I genuinely don’t believe their decision transformer results. I.e. I think with p~0.8, if they (or the authors of the paper whose hyperparameters they copied) made better design choices, they would have gotten a decision transformer that was actually sensitive to reward. But on the flip side, with p~0.2 they just showed that decision transformers don’t work! (For these tasks.)
Wikipedia says: