The notion of ‘predictive coding’ in the result you cite here is inconsistent with the notion of ‘predictive processing’ you cite later in the Pong-playing result.
You’re right. That’s not even the only reason my two uses of “predictive processing” are inconsistent. Another one is that the pong result has a concept of time whereas the backpropagation-equivalent “predictive processing” has no concept of time.
In other words, it doesn’t learn to plan ahead to minimize overall prediction error. It only learns to adjust weights locally to minimize local error.
You’re right again. But one implication of Predictive Coding has been Unified with Backpropagation is that local error minimization converges to global error minimization (for well-behaved functions).
So if you apply noise to the whole network, it doesn’t learn to actively avoid such noise. Each neuron would incorrectly blame itself for its own error, effectively introducing noise into the learning function.
Some of the neurons would be to blame. Other neurons are not to blame. The neurons that are to blame would be nudged in the right direction. The neurons that are not to blame would be nudged in random directions. I agree that this is a crude, noisy way to train a neural network. But there is a signal and the signal does point in the right direction. It’s not just a random walk.
You’re are correct that there is a random walk effect too and that the random walk also pushes the network away from pain.
You’re right again. But one implication of Predictive Coding has been Unified with Backpropagation is that local error minimization converges to global error minimization (for well-behaved functions).
I worried you would respond this way at this point in my message. I should have been more careful about what I meant by local vs global there. By “global” I wanted to say something like, tracking credit assignment beyond the causal network of the computation graph itself. That is to say, tracking the possibility that the decision at output 1 might have influence on the loss at output 2, even though output 2 doesn’t depend on output 1 in the network. For example, policy-gradient algorithms do this, while gradient descent on predictive accuracy doesn’t.
Some of the neurons would be to blame. Other neurons are not to blame. The neurons that are to blame would be nudged in the right direction. The neurons that are not to blame would be nudged in random directions. I agree that this is a crude, noisy way to train a neural network. But there is a signal and the signal does point in the right direction. It’s not just a random walk.
Your are correct that there is a random walk effect too and that the random walk also pushes the network away from pain.
I’m not quite sure what claim you’re making and I suspect we still disagree. It seems to me like there’s an empirical disagreement which could be tested: namely, can we get NNs to play Pong without any RL technique?
I’m not sure exactly how to flesh out the claim, but I predict that the random noise thing would be a much weaker Pong player than policy gradient.
(Note that policy gradient isn’t an extremely strong method, itself; it doesn’t have any world model. In situations where the reward is high variance, it still treats it as a learning signal, superstitiously boosting/negating behaviors. A learner with a world model can learn that rewards are high variance in specific situations, and “ignore” the unreliable training signal, absorbing it into an estimation of the mean reward in those circumstances.
So the shortcomings of policy gradient might actually be comparable to the shortcomings of the apply-noise learning method; both sort of random-walk in the presence of uncontrollable risks. So I think it’s sort of a fair comparison.)
I’m not familiar with policy-gradient algorithms. When you write “tracking credit assignment beyond the causal network of the computation graph itself”, I don’t understand what you mean either. What do you mean?
It seems to me like there’s an empirical disagreement which could be tested: namely, can we get NNs to play Pong without any RL technique?…I’m not sure exactly how to flesh out the claim, but I predict that the random noise thing would be a much weaker Pong player than policy gradient.
This sounds like a concrete scenario we can disagree about, but when you predict that the random noise thing would be weaker than policy gradient, I agree with you—and I don’t even know what a policy gradient is. The random noise thing is awful. I just claim that it works. I don’t claim that it works well.
To clarify, I consider the random noise thing to be so convoluted it ends up in the grey zone between “RL” and “not RL”. I think you disagree, but I’m not sure in what direction. Do you consider the weird random noise thing to be a kind of RL?
I think we can get NNs to play pong without RL (for my definition of RL—yours may differ) but it’s complicated, I haven’t fully fleshed it out yet, and it’s even weirder than the random noise thing. The posts you’ve been reading and commenting on are building up to that. But we’re not there yet.
By the way, what, precisely, you mean when you write “reinforcement learning”?
So, as you may know, RL is divided into model-based and model-free. I think of policy gradient as the most extremely model-free. Basically you take the summed reward of an entire episode to be a black-box training signal. You’re better off if you can estimate a baseline reward so that you know whether to treat an episode as ‘good’ or ‘bad’ to the extent it falls above/below baseline, but you don’t even need to do this. I think this is basically because if all rewards are positive, you can gradient toward ‘what I did this round’, and larger gradients will pull you more toward the good, while smaller gradients will still pull you toward the bad, but less so. Similar reasoning applies if some or all rewards are negative.
(Policy-gradient doesn’t actually need to be episodic, however. You can also blur out credit assignment based on temporal discounting, rather than discrete episodes.)
When you write “tracking credit assignment beyond the causal network of the computation graph itself”, I don’t understand what you mean either. What do you mean?
This is absolutely central to my point, so I’ll spend a while trying to make this clear.
Imagine a feedforward neural network doing supervised learning. For each output generated, you get a loss, which you can attribute 100% to the output. You can then work back—you know how to causally attribute the output value to its inputs one layer back. You know how to causally attribute those to the next layer back. And so on.
This is like an assembly-line factory where you are trying to optimize some feature of the output, based on near-complete understanding & control of the causal system of the factory. You can attribute features of the output to one stage back in the assembly line. You can attribute what happens one stage back to what happened one stage before that. And so on. If you aren’t satisfied with the output, you can work your way back and fiddle with whatever contributed to the output, because the entire causal system is under your control.
(The same idea applies to recurrent NNs; we just need to track the activations back in time, rather than only through a single pass through the network. Factories are, of course, more like recurrent NNs than feedforward NNs, since variables like supply of raw materials, & state of repair of machines, will vary over time based on how we run the factory.)
Now imagine a neural network doing reinforcement learning, in a POMDP environment. You can’t attribute the reward this round to the action this round, so you can’t work your way backwards to fix the root cause of a problem in the same way—you don’t have the full causal model, because an important part of the causality leading to this specific reward is outside, in the environment. In general, the next reward might depend on any or all of your past actions.
This is more like a company trying to maximize profits. You can still tweak the assembly-line as much as you want, but your quarterly profits will depend on what you do in a mysterious way. Bad quarterly profits might be due to a bad reputation which you got based on your products from two years ago. Although you know your factory, you can’t firmly attribute profits in a given quarter to factory performance in a specific quarter.
Model-based RL solves this by making a causal model of the environment, so that we can do credit assignment via our current estimated causal model. We don’t know that our current slump in sales is due to the bad products we shipped two years ago, but it might be our best guess. In which case, we think we’ve already taken appropriate actions to correct our assembly-line performance, so all we can do for now is spend some more money on advertising, or what-have-you.
Model-free RL solves this problem without forming a specific causal model of the environment. Instead, credit has to be assigned broadly. This is metaphorically like giving everyone stock options, so that everyone in the company gets punished/rewarded together. (Although this metaphor isn’t great, because it commits the homunculus fallacy by ascribing agency to all the little pieces—makes sense for companies, not so much for NNs. Really it’s more like we’re adjusting all the pieces of the assembly line all the time, based on the details of the model-free alg.)
This sounds like a concrete scenario we can disagree about, but when you predict that the random noise thing would be weaker than policy gradient, I agree with you—and I don’t even know what a policy gradient is. The random noise thing is awful. I just claim that it works. I don’t claim that it works well.
It sounds like we won’t be able to get much empirical traction this way, then.
My question to you is: what’s so interesting about the PP analysis of the Pong experiment, then, if you agree that the random-noise-RL thing doesn’t work very well compared to alternatives? IE, why derive excitement about a particular deep theory about intelligence (PP) based on a really dumb learning algorithm (noise-based RL)?
I’m not saying this is dumb. I have expressed excitement about generalizations of the Pavlov learning strategy despite it being a really dumb, terrible RL algorithm. (Indeed, it has some similarity to the noise-RL idea!) That’s because this learning algorithm, though dumb, accomplishes something which others don’t (namely, coordinating on pareto-optimal outcomes in multi-agent situations, while being selfishly rational in single-agent situations, all without using a world-model that distinguishes “agents” from “non-agents” in any way). The success of this dumb learning method gives me some hope that smarter methods accomplishing the same thing might exist.
So when I say “what’s so interesting here, if you agree the algorithm is pretty dumb”, it’s not entirely rhetorical.
Thank you for the explanations. They were crystal-clear.
[W]hat’s so interesting about the PP analysis of the Pong experiment, then, if you agree that the random-noise-RL thing doesn’t work very well compared to alternatives? IE, why derive excitement about a particular deep theory about intelligence (PP) based on a really dumb learning algorithm (noise-based RL)?
What alternatives? Do you mean like flooding the network with neurotransmitters? Or do you mean like the stuff we use in ML? There’s lots of better-performing algorithms that you can implement on an electronic computer, but many of them just won’t run on evolved biological neuron cells.
Why the Pong experiment caught my attention is that it is relatively simple on the hardware side, which means it could have evolved very early in the evolutionary chain.
Well, I guess another alternative (also very simple on the hardware side) would be the generalization of the Pavlov strategy which I mentioned earlier. This also has the nice feature that lots of little pieces with their own simple ‘goals’ can coalesce into one agent-like strategy, and it furthermore works with as much or little communication as you give it (so there’s not automatically a communication overhead to achieve the coordination).
However, I won’t try to argue that it’s plausible that biological brains are using something like that.
I guess the basic answer to my question is that you’re quite motivated by biological plausibility. There are many reasons why this might be, so I shouldn’t guess at the specific motives.
For myself, I tend to be disinterested in biologically plausible algorithms if it’s easy to point at other algorithms which do better with similar efficiency on computers. (Although results like the equivalence between predictive coding and gradient descent can be interesting for other reasons.) I think bounded computation has to do with important secrets of intelligence, but for example, I find logical induction to be a deeper theory of bounded rationality than (my understanding of) predictive processing—predictive processing seems closer to “getting excited about some specific approximation methods” whereas logical induction seems closer to “principled understanding of what good bounded reasoning even means” (and in particular, obsoletes the idea that bounded rationality is about approximating, in my mind).
I guess the basic answer to my question is that you’re quite motivated by biological plausibility. There are many reasons why this might be, so I shouldn’t guess at the specific motives.
You’re right. I want to know how my own brain works.
But if you’re more interested in a broader mathematical understanding of how intelligence, in general, works, then that could explain some of our motivational disconnect.
An important question is whether PP contains some secrets of intelligence which are critical for AI alignment. I think some intelligent people think the answer is yes. But the biological motivation doesn’t especially point to this (I think). If you have any arguments for such a conclusion I would be curious to hear it.
Maybe? It depends a lot on how I interpret your question. I’m trying to keep these posts contained and so I’d rather not answer that question in this thread.
You’re right. That’s not even the only reason my two uses of “predictive processing” are inconsistent. Another one is that the pong result has a concept of time whereas the backpropagation-equivalent “predictive processing” has no concept of time.
You’re right again. But one implication of Predictive Coding has been Unified with Backpropagation is that local error minimization converges to global error minimization (for well-behaved functions).
Some of the neurons would be to blame. Other neurons are not to blame. The neurons that are to blame would be nudged in the right direction. The neurons that are not to blame would be nudged in random directions. I agree that this is a crude, noisy way to train a neural network. But there is a signal and the signal does point in the right direction. It’s not just a random walk.
You’re are correct that there is a random walk effect too and that the random walk also pushes the network away from pain.
[Edit: Corrected “Your” to “You’re”.]
I worried you would respond this way at this point in my message. I should have been more careful about what I meant by local vs global there. By “global” I wanted to say something like, tracking credit assignment beyond the causal network of the computation graph itself. That is to say, tracking the possibility that the decision at output 1 might have influence on the loss at output 2, even though output 2 doesn’t depend on output 1 in the network. For example, policy-gradient algorithms do this, while gradient descent on predictive accuracy doesn’t.
I’m not quite sure what claim you’re making and I suspect we still disagree. It seems to me like there’s an empirical disagreement which could be tested: namely, can we get NNs to play Pong without any RL technique?
I’m not sure exactly how to flesh out the claim, but I predict that the random noise thing would be a much weaker Pong player than policy gradient.
(Note that policy gradient isn’t an extremely strong method, itself; it doesn’t have any world model. In situations where the reward is high variance, it still treats it as a learning signal, superstitiously boosting/negating behaviors. A learner with a world model can learn that rewards are high variance in specific situations, and “ignore” the unreliable training signal, absorbing it into an estimation of the mean reward in those circumstances.
So the shortcomings of policy gradient might actually be comparable to the shortcomings of the apply-noise learning method; both sort of random-walk in the presence of uncontrollable risks. So I think it’s sort of a fair comparison.)
I’m not familiar with policy-gradient algorithms. When you write “tracking credit assignment beyond the causal network of the computation graph itself”, I don’t understand what you mean either. What do you mean?
This sounds like a concrete scenario we can disagree about, but when you predict that the random noise thing would be weaker than policy gradient, I agree with you—and I don’t even know what a policy gradient is. The random noise thing is awful. I just claim that it works. I don’t claim that it works well.
To clarify, I consider the random noise thing to be so convoluted it ends up in the grey zone between “RL” and “not RL”. I think you disagree, but I’m not sure in what direction. Do you consider the weird random noise thing to be a kind of RL?
I think we can get NNs to play pong without RL (for my definition of RL—yours may differ) but it’s complicated, I haven’t fully fleshed it out yet, and it’s even weirder than the random noise thing. The posts you’ve been reading and commenting on are building up to that. But we’re not there yet.
By the way, what, precisely, you mean when you write “reinforcement learning”?
So, as you may know, RL is divided into model-based and model-free. I think of policy gradient as the most extremely model-free. Basically you take the summed reward of an entire episode to be a black-box training signal. You’re better off if you can estimate a baseline reward so that you know whether to treat an episode as ‘good’ or ‘bad’ to the extent it falls above/below baseline, but you don’t even need to do this. I think this is basically because if all rewards are positive, you can gradient toward ‘what I did this round’, and larger gradients will pull you more toward the good, while smaller gradients will still pull you toward the bad, but less so. Similar reasoning applies if some or all rewards are negative.
(Policy-gradient doesn’t actually need to be episodic, however. You can also blur out credit assignment based on temporal discounting, rather than discrete episodes.)
This is absolutely central to my point, so I’ll spend a while trying to make this clear.
Imagine a feedforward neural network doing supervised learning. For each output generated, you get a loss, which you can attribute 100% to the output. You can then work back—you know how to causally attribute the output value to its inputs one layer back. You know how to causally attribute those to the next layer back. And so on.
This is like an assembly-line factory where you are trying to optimize some feature of the output, based on near-complete understanding & control of the causal system of the factory. You can attribute features of the output to one stage back in the assembly line. You can attribute what happens one stage back to what happened one stage before that. And so on. If you aren’t satisfied with the output, you can work your way back and fiddle with whatever contributed to the output, because the entire causal system is under your control.
(The same idea applies to recurrent NNs; we just need to track the activations back in time, rather than only through a single pass through the network. Factories are, of course, more like recurrent NNs than feedforward NNs, since variables like supply of raw materials, & state of repair of machines, will vary over time based on how we run the factory.)
Now imagine a neural network doing reinforcement learning, in a POMDP environment. You can’t attribute the reward this round to the action this round, so you can’t work your way backwards to fix the root cause of a problem in the same way—you don’t have the full causal model, because an important part of the causality leading to this specific reward is outside, in the environment. In general, the next reward might depend on any or all of your past actions.
This is more like a company trying to maximize profits. You can still tweak the assembly-line as much as you want, but your quarterly profits will depend on what you do in a mysterious way. Bad quarterly profits might be due to a bad reputation which you got based on your products from two years ago. Although you know your factory, you can’t firmly attribute profits in a given quarter to factory performance in a specific quarter.
Model-based RL solves this by making a causal model of the environment, so that we can do credit assignment via our current estimated causal model. We don’t know that our current slump in sales is due to the bad products we shipped two years ago, but it might be our best guess. In which case, we think we’ve already taken appropriate actions to correct our assembly-line performance, so all we can do for now is spend some more money on advertising, or what-have-you.
Model-free RL solves this problem without forming a specific causal model of the environment. Instead, credit has to be assigned broadly. This is metaphorically like giving everyone stock options, so that everyone in the company gets punished/rewarded together. (Although this metaphor isn’t great, because it commits the homunculus fallacy by ascribing agency to all the little pieces—makes sense for companies, not so much for NNs. Really it’s more like we’re adjusting all the pieces of the assembly line all the time, based on the details of the model-free alg.)
It sounds like we won’t be able to get much empirical traction this way, then.
My question to you is: what’s so interesting about the PP analysis of the Pong experiment, then, if you agree that the random-noise-RL thing doesn’t work very well compared to alternatives? IE, why derive excitement about a particular deep theory about intelligence (PP) based on a really dumb learning algorithm (noise-based RL)?
I’m not saying this is dumb. I have expressed excitement about generalizations of the Pavlov learning strategy despite it being a really dumb, terrible RL algorithm. (Indeed, it has some similarity to the noise-RL idea!) That’s because this learning algorithm, though dumb, accomplishes something which others don’t (namely, coordinating on pareto-optimal outcomes in multi-agent situations, while being selfishly rational in single-agent situations, all without using a world-model that distinguishes “agents” from “non-agents” in any way). The success of this dumb learning method gives me some hope that smarter methods accomplishing the same thing might exist.
So when I say “what’s so interesting here, if you agree the algorithm is pretty dumb”, it’s not entirely rhetorical.
Thank you for the explanations. They were crystal-clear.
What alternatives? Do you mean like flooding the network with neurotransmitters? Or do you mean like the stuff we use in ML? There’s lots of better-performing algorithms that you can implement on an electronic computer, but many of them just won’t run on evolved biological neuron cells.
Why the Pong experiment caught my attention is that it is relatively simple on the hardware side, which means it could have evolved very early in the evolutionary chain.
Well, I guess another alternative (also very simple on the hardware side) would be the generalization of the Pavlov strategy which I mentioned earlier. This also has the nice feature that lots of little pieces with their own simple ‘goals’ can coalesce into one agent-like strategy, and it furthermore works with as much or little communication as you give it (so there’s not automatically a communication overhead to achieve the coordination).
However, I won’t try to argue that it’s plausible that biological brains are using something like that.
I guess the basic answer to my question is that you’re quite motivated by biological plausibility. There are many reasons why this might be, so I shouldn’t guess at the specific motives.
For myself, I tend to be disinterested in biologically plausible algorithms if it’s easy to point at other algorithms which do better with similar efficiency on computers. (Although results like the equivalence between predictive coding and gradient descent can be interesting for other reasons.) I think bounded computation has to do with important secrets of intelligence, but for example, I find logical induction to be a deeper theory of bounded rationality than (my understanding of) predictive processing—predictive processing seems closer to “getting excited about some specific approximation methods” whereas logical induction seems closer to “principled understanding of what good bounded reasoning even means” (and in particular, obsoletes the idea that bounded rationality is about approximating, in my mind).
You’re right. I want to know how my own brain works.
But if you’re more interested in a broader mathematical understanding of how intelligence, in general, works, then that could explain some of our motivational disconnect.
An important question is whether PP contains some secrets of intelligence which are critical for AI alignment. I think some intelligent people think the answer is yes. But the biological motivation doesn’t especially point to this (I think). If you have any arguments for such a conclusion I would be curious to hear it.
Maybe? It depends a lot on how I interpret your question. I’m trying to keep these posts contained and so I’d rather not answer that question in this thread.