I’m not familiar with policy-gradient algorithms. When you write “tracking credit assignment beyond the causal network of the computation graph itself”, I don’t understand what you mean either. What do you mean?
It seems to me like there’s an empirical disagreement which could be tested: namely, can we get NNs to play Pong without any RL technique?…I’m not sure exactly how to flesh out the claim, but I predict that the random noise thing would be a much weaker Pong player than policy gradient.
This sounds like a concrete scenario we can disagree about, but when you predict that the random noise thing would be weaker than policy gradient, I agree with you—and I don’t even know what a policy gradient is. The random noise thing is awful. I just claim that it works. I don’t claim that it works well.
To clarify, I consider the random noise thing to be so convoluted it ends up in the grey zone between “RL” and “not RL”. I think you disagree, but I’m not sure in what direction. Do you consider the weird random noise thing to be a kind of RL?
I think we can get NNs to play pong without RL (for my definition of RL—yours may differ) but it’s complicated, I haven’t fully fleshed it out yet, and it’s even weirder than the random noise thing. The posts you’ve been reading and commenting on are building up to that. But we’re not there yet.
By the way, what, precisely, you mean when you write “reinforcement learning”?
So, as you may know, RL is divided into model-based and model-free. I think of policy gradient as the most extremely model-free. Basically you take the summed reward of an entire episode to be a black-box training signal. You’re better off if you can estimate a baseline reward so that you know whether to treat an episode as ‘good’ or ‘bad’ to the extent it falls above/below baseline, but you don’t even need to do this. I think this is basically because if all rewards are positive, you can gradient toward ‘what I did this round’, and larger gradients will pull you more toward the good, while smaller gradients will still pull you toward the bad, but less so. Similar reasoning applies if some or all rewards are negative.
(Policy-gradient doesn’t actually need to be episodic, however. You can also blur out credit assignment based on temporal discounting, rather than discrete episodes.)
When you write “tracking credit assignment beyond the causal network of the computation graph itself”, I don’t understand what you mean either. What do you mean?
This is absolutely central to my point, so I’ll spend a while trying to make this clear.
Imagine a feedforward neural network doing supervised learning. For each output generated, you get a loss, which you can attribute 100% to the output. You can then work back—you know how to causally attribute the output value to its inputs one layer back. You know how to causally attribute those to the next layer back. And so on.
This is like an assembly-line factory where you are trying to optimize some feature of the output, based on near-complete understanding & control of the causal system of the factory. You can attribute features of the output to one stage back in the assembly line. You can attribute what happens one stage back to what happened one stage before that. And so on. If you aren’t satisfied with the output, you can work your way back and fiddle with whatever contributed to the output, because the entire causal system is under your control.
(The same idea applies to recurrent NNs; we just need to track the activations back in time, rather than only through a single pass through the network. Factories are, of course, more like recurrent NNs than feedforward NNs, since variables like supply of raw materials, & state of repair of machines, will vary over time based on how we run the factory.)
Now imagine a neural network doing reinforcement learning, in a POMDP environment. You can’t attribute the reward this round to the action this round, so you can’t work your way backwards to fix the root cause of a problem in the same way—you don’t have the full causal model, because an important part of the causality leading to this specific reward is outside, in the environment. In general, the next reward might depend on any or all of your past actions.
This is more like a company trying to maximize profits. You can still tweak the assembly-line as much as you want, but your quarterly profits will depend on what you do in a mysterious way. Bad quarterly profits might be due to a bad reputation which you got based on your products from two years ago. Although you know your factory, you can’t firmly attribute profits in a given quarter to factory performance in a specific quarter.
Model-based RL solves this by making a causal model of the environment, so that we can do credit assignment via our current estimated causal model. We don’t know that our current slump in sales is due to the bad products we shipped two years ago, but it might be our best guess. In which case, we think we’ve already taken appropriate actions to correct our assembly-line performance, so all we can do for now is spend some more money on advertising, or what-have-you.
Model-free RL solves this problem without forming a specific causal model of the environment. Instead, credit has to be assigned broadly. This is metaphorically like giving everyone stock options, so that everyone in the company gets punished/rewarded together. (Although this metaphor isn’t great, because it commits the homunculus fallacy by ascribing agency to all the little pieces—makes sense for companies, not so much for NNs. Really it’s more like we’re adjusting all the pieces of the assembly line all the time, based on the details of the model-free alg.)
This sounds like a concrete scenario we can disagree about, but when you predict that the random noise thing would be weaker than policy gradient, I agree with you—and I don’t even know what a policy gradient is. The random noise thing is awful. I just claim that it works. I don’t claim that it works well.
It sounds like we won’t be able to get much empirical traction this way, then.
My question to you is: what’s so interesting about the PP analysis of the Pong experiment, then, if you agree that the random-noise-RL thing doesn’t work very well compared to alternatives? IE, why derive excitement about a particular deep theory about intelligence (PP) based on a really dumb learning algorithm (noise-based RL)?
I’m not saying this is dumb. I have expressed excitement about generalizations of the Pavlov learning strategy despite it being a really dumb, terrible RL algorithm. (Indeed, it has some similarity to the noise-RL idea!) That’s because this learning algorithm, though dumb, accomplishes something which others don’t (namely, coordinating on pareto-optimal outcomes in multi-agent situations, while being selfishly rational in single-agent situations, all without using a world-model that distinguishes “agents” from “non-agents” in any way). The success of this dumb learning method gives me some hope that smarter methods accomplishing the same thing might exist.
So when I say “what’s so interesting here, if you agree the algorithm is pretty dumb”, it’s not entirely rhetorical.
Thank you for the explanations. They were crystal-clear.
[W]hat’s so interesting about the PP analysis of the Pong experiment, then, if you agree that the random-noise-RL thing doesn’t work very well compared to alternatives? IE, why derive excitement about a particular deep theory about intelligence (PP) based on a really dumb learning algorithm (noise-based RL)?
What alternatives? Do you mean like flooding the network with neurotransmitters? Or do you mean like the stuff we use in ML? There’s lots of better-performing algorithms that you can implement on an electronic computer, but many of them just won’t run on evolved biological neuron cells.
Why the Pong experiment caught my attention is that it is relatively simple on the hardware side, which means it could have evolved very early in the evolutionary chain.
Well, I guess another alternative (also very simple on the hardware side) would be the generalization of the Pavlov strategy which I mentioned earlier. This also has the nice feature that lots of little pieces with their own simple ‘goals’ can coalesce into one agent-like strategy, and it furthermore works with as much or little communication as you give it (so there’s not automatically a communication overhead to achieve the coordination).
However, I won’t try to argue that it’s plausible that biological brains are using something like that.
I guess the basic answer to my question is that you’re quite motivated by biological plausibility. There are many reasons why this might be, so I shouldn’t guess at the specific motives.
For myself, I tend to be disinterested in biologically plausible algorithms if it’s easy to point at other algorithms which do better with similar efficiency on computers. (Although results like the equivalence between predictive coding and gradient descent can be interesting for other reasons.) I think bounded computation has to do with important secrets of intelligence, but for example, I find logical induction to be a deeper theory of bounded rationality than (my understanding of) predictive processing—predictive processing seems closer to “getting excited about some specific approximation methods” whereas logical induction seems closer to “principled understanding of what good bounded reasoning even means” (and in particular, obsoletes the idea that bounded rationality is about approximating, in my mind).
I guess the basic answer to my question is that you’re quite motivated by biological plausibility. There are many reasons why this might be, so I shouldn’t guess at the specific motives.
You’re right. I want to know how my own brain works.
But if you’re more interested in a broader mathematical understanding of how intelligence, in general, works, then that could explain some of our motivational disconnect.
An important question is whether PP contains some secrets of intelligence which are critical for AI alignment. I think some intelligent people think the answer is yes. But the biological motivation doesn’t especially point to this (I think). If you have any arguments for such a conclusion I would be curious to hear it.
Maybe? It depends a lot on how I interpret your question. I’m trying to keep these posts contained and so I’d rather not answer that question in this thread.
I’m not familiar with policy-gradient algorithms. When you write “tracking credit assignment beyond the causal network of the computation graph itself”, I don’t understand what you mean either. What do you mean?
This sounds like a concrete scenario we can disagree about, but when you predict that the random noise thing would be weaker than policy gradient, I agree with you—and I don’t even know what a policy gradient is. The random noise thing is awful. I just claim that it works. I don’t claim that it works well.
To clarify, I consider the random noise thing to be so convoluted it ends up in the grey zone between “RL” and “not RL”. I think you disagree, but I’m not sure in what direction. Do you consider the weird random noise thing to be a kind of RL?
I think we can get NNs to play pong without RL (for my definition of RL—yours may differ) but it’s complicated, I haven’t fully fleshed it out yet, and it’s even weirder than the random noise thing. The posts you’ve been reading and commenting on are building up to that. But we’re not there yet.
By the way, what, precisely, you mean when you write “reinforcement learning”?
So, as you may know, RL is divided into model-based and model-free. I think of policy gradient as the most extremely model-free. Basically you take the summed reward of an entire episode to be a black-box training signal. You’re better off if you can estimate a baseline reward so that you know whether to treat an episode as ‘good’ or ‘bad’ to the extent it falls above/below baseline, but you don’t even need to do this. I think this is basically because if all rewards are positive, you can gradient toward ‘what I did this round’, and larger gradients will pull you more toward the good, while smaller gradients will still pull you toward the bad, but less so. Similar reasoning applies if some or all rewards are negative.
(Policy-gradient doesn’t actually need to be episodic, however. You can also blur out credit assignment based on temporal discounting, rather than discrete episodes.)
This is absolutely central to my point, so I’ll spend a while trying to make this clear.
Imagine a feedforward neural network doing supervised learning. For each output generated, you get a loss, which you can attribute 100% to the output. You can then work back—you know how to causally attribute the output value to its inputs one layer back. You know how to causally attribute those to the next layer back. And so on.
This is like an assembly-line factory where you are trying to optimize some feature of the output, based on near-complete understanding & control of the causal system of the factory. You can attribute features of the output to one stage back in the assembly line. You can attribute what happens one stage back to what happened one stage before that. And so on. If you aren’t satisfied with the output, you can work your way back and fiddle with whatever contributed to the output, because the entire causal system is under your control.
(The same idea applies to recurrent NNs; we just need to track the activations back in time, rather than only through a single pass through the network. Factories are, of course, more like recurrent NNs than feedforward NNs, since variables like supply of raw materials, & state of repair of machines, will vary over time based on how we run the factory.)
Now imagine a neural network doing reinforcement learning, in a POMDP environment. You can’t attribute the reward this round to the action this round, so you can’t work your way backwards to fix the root cause of a problem in the same way—you don’t have the full causal model, because an important part of the causality leading to this specific reward is outside, in the environment. In general, the next reward might depend on any or all of your past actions.
This is more like a company trying to maximize profits. You can still tweak the assembly-line as much as you want, but your quarterly profits will depend on what you do in a mysterious way. Bad quarterly profits might be due to a bad reputation which you got based on your products from two years ago. Although you know your factory, you can’t firmly attribute profits in a given quarter to factory performance in a specific quarter.
Model-based RL solves this by making a causal model of the environment, so that we can do credit assignment via our current estimated causal model. We don’t know that our current slump in sales is due to the bad products we shipped two years ago, but it might be our best guess. In which case, we think we’ve already taken appropriate actions to correct our assembly-line performance, so all we can do for now is spend some more money on advertising, or what-have-you.
Model-free RL solves this problem without forming a specific causal model of the environment. Instead, credit has to be assigned broadly. This is metaphorically like giving everyone stock options, so that everyone in the company gets punished/rewarded together. (Although this metaphor isn’t great, because it commits the homunculus fallacy by ascribing agency to all the little pieces—makes sense for companies, not so much for NNs. Really it’s more like we’re adjusting all the pieces of the assembly line all the time, based on the details of the model-free alg.)
It sounds like we won’t be able to get much empirical traction this way, then.
My question to you is: what’s so interesting about the PP analysis of the Pong experiment, then, if you agree that the random-noise-RL thing doesn’t work very well compared to alternatives? IE, why derive excitement about a particular deep theory about intelligence (PP) based on a really dumb learning algorithm (noise-based RL)?
I’m not saying this is dumb. I have expressed excitement about generalizations of the Pavlov learning strategy despite it being a really dumb, terrible RL algorithm. (Indeed, it has some similarity to the noise-RL idea!) That’s because this learning algorithm, though dumb, accomplishes something which others don’t (namely, coordinating on pareto-optimal outcomes in multi-agent situations, while being selfishly rational in single-agent situations, all without using a world-model that distinguishes “agents” from “non-agents” in any way). The success of this dumb learning method gives me some hope that smarter methods accomplishing the same thing might exist.
So when I say “what’s so interesting here, if you agree the algorithm is pretty dumb”, it’s not entirely rhetorical.
Thank you for the explanations. They were crystal-clear.
What alternatives? Do you mean like flooding the network with neurotransmitters? Or do you mean like the stuff we use in ML? There’s lots of better-performing algorithms that you can implement on an electronic computer, but many of them just won’t run on evolved biological neuron cells.
Why the Pong experiment caught my attention is that it is relatively simple on the hardware side, which means it could have evolved very early in the evolutionary chain.
Well, I guess another alternative (also very simple on the hardware side) would be the generalization of the Pavlov strategy which I mentioned earlier. This also has the nice feature that lots of little pieces with their own simple ‘goals’ can coalesce into one agent-like strategy, and it furthermore works with as much or little communication as you give it (so there’s not automatically a communication overhead to achieve the coordination).
However, I won’t try to argue that it’s plausible that biological brains are using something like that.
I guess the basic answer to my question is that you’re quite motivated by biological plausibility. There are many reasons why this might be, so I shouldn’t guess at the specific motives.
For myself, I tend to be disinterested in biologically plausible algorithms if it’s easy to point at other algorithms which do better with similar efficiency on computers. (Although results like the equivalence between predictive coding and gradient descent can be interesting for other reasons.) I think bounded computation has to do with important secrets of intelligence, but for example, I find logical induction to be a deeper theory of bounded rationality than (my understanding of) predictive processing—predictive processing seems closer to “getting excited about some specific approximation methods” whereas logical induction seems closer to “principled understanding of what good bounded reasoning even means” (and in particular, obsoletes the idea that bounded rationality is about approximating, in my mind).
You’re right. I want to know how my own brain works.
But if you’re more interested in a broader mathematical understanding of how intelligence, in general, works, then that could explain some of our motivational disconnect.
An important question is whether PP contains some secrets of intelligence which are critical for AI alignment. I think some intelligent people think the answer is yes. But the biological motivation doesn’t especially point to this (I think). If you have any arguments for such a conclusion I would be curious to hear it.
Maybe? It depends a lot on how I interpret your question. I’m trying to keep these posts contained and so I’d rather not answer that question in this thread.