I was under the impression that PPO was a recently invented algorithm
Well, if we’re going to get historical, PPO is a relatively small variation on Williams’s REINFORCE policy gradient model-free RL algorithm from 1992 (or earlier if you count conferences etc), with a bunch of minor DL implementation tweaks that turn out to help a lot. I don’t offhand know of any ways in which PPO’s tweaks make it meaningfully different from REINFORCE from the perspective of safety, aside from the obvious ones of working better in practice. (Which is the main reason why PPO became OA’s workhorse in its model-free RL era to train small CNNs/RNNs, before they moved to model-based RL using Transformer LLMs. Policy gradient methods based on REINFORCE certainly were not novel, but they started scaling earlier.)
So, PPO is recent, yes, but that isn’t really important to anything here. TurnedTrout could just as well have used REINFORCE as the example instead.
Did RL researchers in the 1990’s sit down and carefully analyze the inductive biases of PPO on huge 2026-era LLMs, conclude that PPO probably entrains LLMs which make decisions on the basis of their own reinforcement signal, and then decide to say “RL trains agents to maximize reward”? Of course not.
I don’t know how you (TurnTrout) can say that. It certainly seems to me that plenty of researchers in 1992 were talking about either model-based RL or using model-free approaches to ground model-based RL—indeed, it’s hard to see how anything else could work in connectionism, given that model-free methods are simpler, many animals or organisms do things that can be interpreted as model-free but not model-based (while all creatures who do model-based RL, like humans, clearly also do model-free), and so on. The model-based RL was the ‘cherry on the cake’, if I may put it that way… These arguments were admittedly handwavy: “if we can’t write AGI from scratch, then we can try to learn it from scratch starting with model-free approaches like Hebbian learning, and somewhere between roughly mouse-level and human/AGI, a miracle happens, and we get full model-based reasoning”. But hey, can’t argue with success there! We have loads of nice results from DeepMind and others with this sort of flavor†.
On the other hand, I’m not able to think of any dissenters which claim that you could have AGI purely using model-free RL with no model-based RL anywhere to be seen? Like, you can imagine it working (eg. in silico environments for everything), but it’s not very plausible since it would seem like the computational requirements go astronomical fast.
Back then, they had a richer conception of RL, heavier on the model-based RL half of the field, and one more relevant to the current era, than the impoverished 2017 era of ‘let’s just PPO/Impala everything we can’t MCTS and not talk about how this is supposed to reach AGI, exactly, even if it scales reasonably well’. If you want to critique what AI researchers could imagine back in 1992, you should be reading Schmidhuber, not Bostrom. (“Computing is a pop culture”, as Kay put it, and DL, and DRL, are especially pop culture right now. Which is not necessarily a bad thing if you’re just trying to get things to work, but if you are going to make historical arguments about what people were or were not thinking in 2014, or 1992, pop culture isn’t going to cut the mustard. People back then weren’t stupid, and often had very sophisticated well-thought-out ideas & paradigms; they just had a millionth of the compute/data/infrastructure they needed to make any of it work properly...)
If you look at that REINFORCE paper, Williams isn’t even all that concerned with direct use of it to train a model to solve RL tasks.* He’s more concerned with handling non-differentiable things in general, like stochastic rather than the usual deterministic neurons we use, so you could ‘backpropagate through the environment’ models like Schmidhuber & Huber 1990, which bootstrap from random initialization using the high-variance REINFORCE-like learning signal to a superior model. (Hm, why, that sounds like the sort of thing you might do if you analyze the inductive biases of model-free approaches which entrain larger systems which have their own internal reinforcement signals which they maximize...) As Schmidhuber has been saying for decades, it’s meta-learning all the way up/down. The species-level model-free RL algorithm (evolution) creates model-free within-lifetime learning algorithms (like REINFORCE), which creates model-based within-lifetime learning algorithms (like neural net models) which create learning over families (generalization) for cross-task within-lifetime learning which create learning algorithms (ICL/history-based meta-learners**) for within-episode learning which create...
It’s no surprise that the “multiply a set of candidate entities by a fixed small percentage based on each entity’s reward” algorithm pops up everywhere from evolution to free markets to DRL to ensemble machine learning over ‘experts’, because that model-free algorithm is always available as the fallback strategy when you can’t do anything smarter (yet). Model-free is just the first step, and in many ways, least interesting & important step. I’m always weirded out to read one of these posts where something like PPO or evolution strategies is treated as the only RL algorithm around and things like expert iteration an annoying nuisance to be relegated to a footnote - ‘reward is not the optimization target!* * except when it is in these annoying exceptions like AlphaZero, but fortunately, we can ignore these, because after all, it’s not like humans or AGI or superintelligences would ever do crazy stuff like “plan” or “reason” or “search”’.
* He’d’ve probably been surprised to see people just… using it for stuff like DoTA2 on fully-differentiable BPTT RNNs. I wonder if he’s ever done any interviews on DL recently? AFAIK he’s still alive.
** Specifically, in the case of Transformers, it seems to be by self-attention doing gradient descent steps on an abstracted version of a problem; gradient descent itself isn’t a very smart algorithm, but if the abstract version is a model that encodes the correct sufficient statistics of the broader meta-problem, then it can be very easy to make Bayes-optimal predictions/choices for any specific problem.
† my paper-of-the-day website feature yesterday popped up“Learning few-shot imitation as cultural transmission”, Bhoopchand et al 2023 (excerpts) which is a nice example because they show clearly how history+diverse-environments+simple-priors-of-an-evolvable-sort elicit ‘inner’ model-like imitation learning starting from the initial ‘outer’ model-free RL algorithm (MPO, an actor-critic).
‘reward is not the optimization target!* *except when it is in these annoying exceptions like AlphaZero, but fortunately, we can ignore these, because after all, it’s not like humans or AGI or superintelligences would ever do crazy stuff like “plan” or “reason” or “search”’.
If you’re going to mock me, at least be correct when you do it!
I think that reward is still not the optimization target in AlphaZero (the way I’m using the term, at least). Learning a leaf node evaluator on a given reinforcement signal, and then bootstrapping the leaf node evaluator via MCTS on that leaf node evaluator, does not mean that the aggregate trained system
directly optimizes for the reinforcement signal, or
“cares” about that reinforcement signal,
or “does its best” to optimize the reinforcement signal (as opposed to some historical reinforcement correlate, like winning or capturing pieces or something stranger).
If most of the “optimization power” were coming from e.g. MCTS on direct reward signal, then yup, I’d agree that the reward signal is the primary optimization target of this system. That isn’t the case here.
You might use the phrase “reward as optimization target” differently than I do, but if we’re just using words differently, then it wouldn’t be appropriate to describe me as “ignoring planning.”
Learning a leaf node evaluator on a given reinforcement signal, and then bootstrapping the leaf node evaluator via MCTS on that leaf node evaluator, does not mean that the aggregate trained system
directly optimizes for the reinforcement signal, or
“cares” about that reinforcement signal,
or “does its best” to optimize the reinforcement signal (as opposed to some historical reinforcement correlate, like winning or capturing pieces or something stranger).
Yes, it does mean all of that, because MCTS is asymptotically optimal (unsurprisingly, given that it’s a tree search on the model), and will eg. happily optimize the reinforcement signal rather than proxies like capturing pieces as it learns through search that capturing pieces in particular states is not as useful as usual. If you expand out the search tree long enough (whether or not you use the AlphaZero NN to make that expansion more efficient by evaluating intermediate nodes and then back-propagating that through the current tree), then it converges on the complete, true, ground truth game tree, with all leafs evaluated with the true reward, with any imperfections in the leaf evaluator value estimate washed out. It directly optimizes the reinforcement signal, cares about nothing else, and is very pleased to lose if that results in a higher reward or not capture pieces if that results in a higher reward.*
All the NN is, is a cache or an amortization of the search algorithm. Caches are important and life would be miserable without them, but it would be absurd to say that adding a cache to a function means “that function doesn’t compute the function” or “the range is not the target of the function”.
I’m a little baffled by this argument that because the NN is not already omniscient and might mis-estimate the value of a leaf node, that apparently it’s not optimizing for the reward and that’s not the goal of the system and the system doesn’t care about reward, no matter how much it converges toward said reward as it plans/searches more, or gets better at acquiring said reward as it fixes those errors.
If most of the “optimization power” were coming from e.g. MCTS on direct reward signal, then yup, I’d agree that the reward signal is the primary optimization target of this system.
The reward signal is in fact the primary optimization target, because it is where the neural net’s value estimates derive from, and the ‘system’ corrects them eventually and converges. The dog wags the tail, sooner or later.
* I think I’ve noted this elsewhere, and mentioned my Kelly coinflip trajectories as nice visualization of how model-based RL will behave as, but to repeat: MCTS algorithms in Go/chess were noted for that sort of behavior, especially for sacrificing pieces or territory while they were ahead, in order to ‘lock down’ the game and maximize the probability of victory, rather than the margin of victory; and vice-versa, for taking big risks when they were behind. Because the tree didn’t back-propagate any rewards on ‘margin’, just on 0⁄1 rewards from victory, and didn’t care about proxy heuristics like ‘pieces captured’ if the tree search found otherwise.
He’d’ve probably been surprised to see people just… using it for stuff like DoTA2 on fully-differentiable BPTT RNNs. I wonder if he’s ever done any interviews on DL recently? AFAIK he’s still alive.
Oh dang! RIP. I guess there’s a lesson there—probably more effort should be put into interviewing the pioneers of connectionism & related fields right now, while they have some perspective and before they all die off.
Well, if we’re going to get historical, PPO is a relatively small variation on Williams’s REINFORCE policy gradient model-free RL algorithm from 1992 (or earlier if you count conferences etc)
Well, if we’re going to get historical, PPO is a relatively small variation on Williams’s REINFORCE policy gradient model-free RL algorithm from 1992 (or earlier if you count conferences etc), with a bunch of minor DL implementation tweaks that turn out to help a lot. I don’t offhand know of any ways in which PPO’s tweaks make it meaningfully different from REINFORCE from the perspective of safety, aside from the obvious ones of working better in practice. (Which is the main reason why PPO became OA’s workhorse in its model-free RL era to train small CNNs/RNNs, before they moved to model-based RL using Transformer LLMs. Policy gradient methods based on REINFORCE certainly were not novel, but they started scaling earlier.)
So, PPO is recent, yes, but that isn’t really important to anything here. TurnedTrout could just as well have used REINFORCE as the example instead.
I don’t know how you (TurnTrout) can say that. It certainly seems to me that plenty of researchers in 1992 were talking about either model-based RL or using model-free approaches to ground model-based RL—indeed, it’s hard to see how anything else could work in connectionism, given that model-free methods are simpler, many animals or organisms do things that can be interpreted as model-free but not model-based (while all creatures who do model-based RL, like humans, clearly also do model-free), and so on. The model-based RL was the ‘cherry on the cake’, if I may put it that way… These arguments were admittedly handwavy: “if we can’t write AGI from scratch, then we can try to learn it from scratch starting with model-free approaches like Hebbian learning, and somewhere between roughly mouse-level and human/AGI, a miracle happens, and we get full model-based reasoning”. But hey, can’t argue with success there! We have loads of nice results from DeepMind and others with this sort of flavor†.
On the other hand, I’m not able to think of any dissenters which claim that you could have AGI purely using model-free RL with no model-based RL anywhere to be seen? Like, you can imagine it working (eg. in silico environments for everything), but it’s not very plausible since it would seem like the computational requirements go astronomical fast.
Back then, they had a richer conception of RL, heavier on the model-based RL half of the field, and one more relevant to the current era, than the impoverished 2017 era of ‘let’s just PPO/Impala everything we can’t MCTS and not talk about how this is supposed to reach AGI, exactly, even if it scales reasonably well’. If you want to critique what AI researchers could imagine back in 1992, you should be reading Schmidhuber, not Bostrom. (“Computing is a pop culture”, as Kay put it, and DL, and DRL, are especially pop culture right now. Which is not necessarily a bad thing if you’re just trying to get things to work, but if you are going to make historical arguments about what people were or were not thinking in 2014, or 1992, pop culture isn’t going to cut the mustard. People back then weren’t stupid, and often had very sophisticated well-thought-out ideas & paradigms; they just had a millionth of the compute/data/infrastructure they needed to make any of it work properly...)
If you look at that REINFORCE paper, Williams isn’t even all that concerned with direct use of it to train a model to solve RL tasks.* He’s more concerned with handling non-differentiable things in general, like stochastic rather than the usual deterministic neurons we use, so you could ‘backpropagate through the environment’ models like Schmidhuber & Huber 1990, which bootstrap from random initialization using the high-variance REINFORCE-like learning signal to a superior model. (Hm, why, that sounds like the sort of thing you might do if you analyze the inductive biases of model-free approaches which entrain larger systems which have their own internal reinforcement signals which they maximize...) As Schmidhuber has been saying for decades, it’s meta-learning all the way up/down. The species-level model-free RL algorithm (evolution) creates model-free within-lifetime learning algorithms (like REINFORCE), which creates model-based within-lifetime learning algorithms (like neural net models) which create learning over families (generalization) for cross-task within-lifetime learning which create learning algorithms (ICL/history-based meta-learners**) for within-episode learning which create...
It’s no surprise that the “multiply a set of candidate entities by a fixed small percentage based on each entity’s reward” algorithm pops up everywhere from evolution to free markets to DRL to ensemble machine learning over ‘experts’, because that model-free algorithm is always available as the fallback strategy when you can’t do anything smarter (yet). Model-free is just the first step, and in many ways, least interesting & important step. I’m always weirded out to read one of these posts where something like PPO or evolution strategies is treated as the only RL algorithm around and things like expert iteration an annoying nuisance to be relegated to a footnote - ‘reward is not the optimization target!* * except when it is in these annoying exceptions like AlphaZero, but fortunately, we can ignore these, because after all, it’s not like humans or AGI or superintelligences would ever do crazy stuff like “plan” or “reason” or “search”’.
* He’d’ve probably been surprised to see people just… using it for stuff like DoTA2 on fully-differentiable BPTT RNNs. I wonder if he’s ever done any interviews on DL recently? AFAIK he’s still alive.
** Specifically, in the case of Transformers, it seems to be by self-attention doing gradient descent steps on an abstracted version of a problem; gradient descent itself isn’t a very smart algorithm, but if the abstract version is a model that encodes the correct sufficient statistics of the broader meta-problem, then it can be very easy to make Bayes-optimal predictions/choices for any specific problem.
† my paper-of-the-day website feature yesterday popped up “Learning few-shot imitation as cultural transmission”, Bhoopchand et al 2023 (excerpts) which is a nice example because they show clearly how history+diverse-environments+simple-priors-of-an-evolvable-sort elicit ‘inner’ model-like imitation learning starting from the initial ‘outer’ model-free RL algorithm (MPO, an actor-critic).
If you’re going to mock me, at least be correct when you do it!
I think that reward is still not the optimization target in AlphaZero (the way I’m using the term, at least). Learning a leaf node evaluator on a given reinforcement signal, and then bootstrapping the leaf node evaluator via MCTS on that leaf node evaluator, does not mean that the aggregate trained system
directly optimizes for the reinforcement signal, or
“cares” about that reinforcement signal,
or “does its best” to optimize the reinforcement signal (as opposed to some historical reinforcement correlate, like winning or capturing pieces or something stranger).
If most of the “optimization power” were coming from e.g. MCTS on direct reward signal, then yup, I’d agree that the reward signal is the primary optimization target of this system. That isn’t the case here.
You might use the phrase “reward as optimization target” differently than I do, but if we’re just using words differently, then it wouldn’t be appropriate to describe me as “ignoring planning.”
Yes, it does mean all of that, because MCTS is asymptotically optimal (unsurprisingly, given that it’s a tree search on the model), and will eg. happily optimize the reinforcement signal rather than proxies like capturing pieces as it learns through search that capturing pieces in particular states is not as useful as usual. If you expand out the search tree long enough (whether or not you use the AlphaZero NN to make that expansion more efficient by evaluating intermediate nodes and then back-propagating that through the current tree), then it converges on the complete, true, ground truth game tree, with all leafs evaluated with the true reward, with any imperfections in the leaf evaluator value estimate washed out. It directly optimizes the reinforcement signal, cares about nothing else, and is very pleased to lose if that results in a higher reward or not capture pieces if that results in a higher reward.*
All the NN is, is a cache or an amortization of the search algorithm. Caches are important and life would be miserable without them, but it would be absurd to say that adding a cache to a function means “that function doesn’t compute the function” or “the range is not the target of the function”.
I’m a little baffled by this argument that because the NN is not already omniscient and might mis-estimate the value of a leaf node, that apparently it’s not optimizing for the reward and that’s not the goal of the system and the system doesn’t care about reward, no matter how much it converges toward said reward as it plans/searches more, or gets better at acquiring said reward as it fixes those errors.
The reward signal is in fact the primary optimization target, because it is where the neural net’s value estimates derive from, and the ‘system’ corrects them eventually and converges. The dog wags the tail, sooner or later.
* I think I’ve noted this elsewhere, and mentioned my Kelly coinflip trajectories as nice visualization of how model-based RL will behave as, but to repeat: MCTS algorithms in Go/chess were noted for that sort of behavior, especially for sacrificing pieces or territory while they were ahead, in order to ‘lock down’ the game and maximize the probability of victory, rather than the margin of victory; and vice-versa, for taking big risks when they were behind. Because the tree didn’t back-propagate any rewards on ‘margin’, just on 0⁄1 rewards from victory, and didn’t care about proxy heuristics like ‘pieces captured’ if the tree search found otherwise.
Sadly, Williams passed away this February: https://www.currentobituary.com/member/obit/282438
Oh dang! RIP. I guess there’s a lesson there—probably more effort should be put into interviewing the pioneers of connectionism & related fields right now, while they have some perspective and before they all die off.
Oops.
Well, I didn’t say it, TurnTrout did.