I think what we have are essentially a bunch of guesses about functionality based on correlations and fairly blunt interventional methods (lesioning), combined with the ideas we’ve come up with about what kinds of algorithms the brain might be running (largely pulling from artificial intelligence for ideas).
I’m guessing you just are significantly more skeptical of both predictive coding and the predictive coding --> backprop link than I am… perhaps because the other hypotheses on my list are less plausible to you?
It makes a lot of sense to me that the brain does something resembling belief propagation on bayes nets. (I take this to be the core idea of predictive coding.) However:
There are a lot of different algorithms resembling belief prop. Sticking within the big tent of “variational methods”, there are a lot of different variational objectives, which result in different algorithms. The brain could be using a variation which we’re unfamiliar with. This could result in significant differences from backprop. (I’m still fond of Hinton’s analogy between contrastive divergence and dreaming, for example. It’s a bit like saying that dreams are GAN-generated adversarial examples, and the brain trains to anti-learn these examples during the night, which results in improved memory consolidation and conceptual clarity during the day. Isn’t that a nice story?)
There are a lot of graphical models besides Bayesian networks. Many of them are “basically the same”, but for example SPNs (sum-product networks) are very different. There’s a sense in which Bayesian networks assume everything is neatly organized into variables already, while SPNs don’t. Also, SPNs are fundamentally faster, so the convergence step in the paper (the step which makes predictive coding 100x slower than belief prop) becomes fast. So SPNs could be a very reasonable alternative, which might not amount to backprop as we know it.
I think it could easily be that the neocortex is explained by some version of predictive coding, but other important elements of the brain are not. In particular, I think the numerical logic of reinforcement learning isn’t easily and efficiently captured via graphical models. I could be ignorant here, but what I know of attempts to fit RL into a predictive-processing paradigm ended up using multiplicative rewards rather than additive (so, you multiply in the new reward rather than adding), simply because adding up a bunch of stuff isn’t natural in graphical models. I think that’s a sign that it’s not the right paradigm.
Radical Probabilism / Logical Uncertainty / Logical Induction makes it generally seem pretty probable, almost necessary, that there’s also some “non-Bayesian” stuff going on in the brain (ie generalized-bayesian, ie non-bayesian updates). This doesn’t seem well-described by predictive coding. This could easily be enough to ruin the analogy between the brain and backprop.
And finally, reiterating the earlier point: there are other algorithms which are more data-efficient than backprop. If humans appear to be more efficient than backprop, then it seems plausible that humans are using a more data-efficient algorithm.
As for the [predictive coding → backprop] link, well, that’s not a crux for me right now, because I was mainly curious why you think such a link, if true, would be evidence against “the brain uses something else that backprop”. I think I understand why you would think that, now, sans what the mounting evidence is.
I think my main crux is the question: (for some appropriate architecture, ie, not necessarily transformers) do human-brain-sized networks, with human-like opportunities for transfer learning, achieve human-level data-efficiency? If so, I have no objection to the hypothesis that the brain uses something more-or-less equivalent to gradient descent.
--I thought the paper about the methods of neuroscience applied to computers was cute, and valuable, but I don’t think it’s fair to conclude “methods are not up to the task.” But you later said that “It makes a lot of sense to me that the brain does something resembling belief propagation on bayes nets. (I take this to be the core idea of predictive coding.)” so you aren’t a radical skeptic about what we can know about the brain so maybe we don’t disagree after all.
1 − 3: OK, I think I’ll defer to your expertise on these points.
4, 5: Whoa whoa, just because we humans do some non-bayesian stuff and some better-than-backprop stuff doesn’t mean that the brain isn’t running pure bayes nets or backprop-approximation or whatever at the low level! That extra fancy cool stuff we do could be happening at a higher level of abstraction. Networks in the brain learned via backprop-approximation could themselves be doing the logical induction stuff and the super-efficient-learning stuff. In which case we should expect that big NN’s trained via backprop might also stumble across similar networks which would then do similarly cool stuff.
I think my main crux is the question: (for some appropriate architecture, ie, not necessarily transformers) do human-brain-sized networks, with human-like opportunities for transfer learning, achieve human-level data-efficiency?
Indeed. Your crux is my question and my crux is your question. (My crux was: Does the brain, at the low level, use something more or less equivalent to the stuff modern NN’s do at a low level? From this I hoped to decide whether human-brain-sized networks could have human-level efficiency)
4, 5: Whoa whoa, just because we humans do some non-bayesian stuff and some better-than-backprop stuff doesn’t mean that the brain isn’t running pure bayes nets or backprop-approximation or whatever at the low level! That extra fancy cool stuff we do could be happening at a higher level of abstraction. Networks in the brain learned via backprop-approximation could themselves be doing the logical induction stuff and the super-efficient-learning stuff. In which case we should expect that big NN’s trained via backprop might also stumble across similar networks which would then do similarly cool stuff.
How could this address point #5? If GD is slow, then GD would be slow to learn faster learning methods.
All of the following is intended as concrete examples against the pure-bayes-brain hypothesis, not as evidence against the brain doing some form of GD:
One thing the brain could be doing under the hood is some form of RL using value-prop. This is difficult to represent in Bayes nets. The attempts I’ve seen end up making reward multiplicative rather than additive across time, which makes sense because Bayes nets are great at multiplying things but not so great at representing additive structures. I think this is OK (we could regard it as an exponential transform of usual reward) until we want to represent temporal discounting. Another problem with this is: representing via graphical models means representing the full distribution over reward values, rather than a point estimate. But this is inefficient compared with regular tabular RL.
Another thing the brain could be doing under the hood is “memory-network” style reasoning which learns a policy for utilizing various forms of memory (visual working memory, auditory working memory, episodic memory, semantic memory...) for reasoning. Because this is fundamentally about logical uncertainty (being unsure about the outcome at the end of some mental work), it’s not very well-represented by Bayesian models. It probably makes more sense to use (model-free) RL to learn how to use WM.
Of course both of those objections could be overcome with a specific sort of work, showing how to represent the desired algorithm in bayes nets.
As for GD:
My back of the envelope calculation suggests that GPT-3 has trained on 7 orders of magnitude more data than a 10yo has experienced in their lifetime. Of course a different NN architecture (+ different task, different loss functions, etc) could just be that much more efficient than transformers; but overall, this doesn’t look good for the human-GD hypothesis.
Maybe your intention is to argue that we use GD with a really good prior, though! This seems much harder to dismiss.
Even if the raw loss(/reward) function is simple and fixed, it’s difficult to turn that into a gradient for learning, because you don’t know how to attribute punishment/loss to specific outputs (actions or cognitive acts). The dumb method, policy-gradient, is highly data inefficient due to attributing reward/punishment to all recent actions (frequently providing spurious gradients which adjust weights up/down noisily).
But, quite possibly, the raw loss/reward function is not simple/fixed, but rather, requires significant inference itself. An example of this is imprinting.
The last two sub-points only argue “against GD” in so far as you mean to suggest that the brain “just uses GD” (where “just” is doing a lot of work). My claim there is that more learning principles are needed (for example, model-based learning) to understand what’s going on.
Indeed. Your crux is my question and my crux is your question. (My crux was: Does the brain, at the low level, use something more or less equivalent to the stuff modern NN’s do at a low level? From this I hoped to decide whether human-brain-sized networks could have human-level efficiency)
Since the brain is difficult to pin down but ML experiments are not, I would think the more natural direction of inference would be to check the scaling laws and see whether it’s plausible that the brain is within the same regime.
Thanks for the great back-and-forth! Did you guys see the first author’s comment? What are the main updates you’ve had re this debate now that it’s been a couple years?
I have not thought about these issues too much in the intervening time. Re-reading the discussion, it sounds plausible to me that the evidence is compatible with roughly brain-sized NNs being roughly as data-efficient as humans. Daniel claims:
If we assume for humans it’s something like 1 second on average (because our brains are evaluating-and-updating weights etc. on about that timescale) then we have a mere 10^9 data points, which is something like 4 OOMs less than the scaling laws would predict. If instead we think it’s longer, then the gap in data-efficiency grows.
I think the human observation-reaction loop is closer to ten times that fast, which results in a 3 OOM difference. This sounds like a gap which is big, but could potentially be explained by architectural differences or other factors, thus preserving a possibility like “human learning is more-or-less gradient descent”. Without articulating the various hypotheses in more detail, this doesn’t seem like strong evidence in any direction.
Not before now. I think the comment had a relatively high probability in my world, where we still have a poor idea of what algorithm the brain is running, and a low probability in Daniel’s world, where evidence is zooming in on predictive coding as the correct hypothesis. Some quotes which I think support my hypothesis better than Daniel’s:
If we (speculatively) associate alpha/beta waves with iterations in predictive coding,
This illustrates how we haven’t pinned down the mechanical parts of algorithms. What this means is that speculation about the algorithm of the brain isn’t yet causally grounded—it’s not as if we’ve been looking at what’s going on and can build up a firm abstract picture of the algorithm from there, the way you might successfully infer rules of traffic by watching a bunch of cars. Instead, we have a bunch of different kinds of information at different resolutions, which we are still trying to stitch together into a coherent picture.
While it’s often claimed that predictive coding is biologically plausible and the best explanation for cortical function, this isn’t really all that clear cut. Firstly, predictive coding itself actually has a bunch of implausibilities. Predictive coding suffers from the same weight transport problem as backprop, and secondly it requires that the prediction and prediction error neurons are 1-1 (i.e. one prediction error neuron for every prediction neuron) which is way too precise connectivity to actually happen in the brain. I’ve been working on ways to adapt predictive coding around these problems as in this paper (https://arxiv.org/pdf/2010.01047.pdf), but this work is currently very preliminary and its unclear if the remedies proposed here will scale to larger architectures.
This directly addresses the question of how clear-cut things are right now, while also pointing to many concrete problems the predictive coding hypothesis faces. The comment continues on that subject for several more paragraphs.
The brain being able to do backprop does not mean that the brain is just doing gradient descent like we do to train ANNs. It is still very possible (in my opinion likely) that the brain could be using a more powerful algorithm for inference and learning—just one that has backprop as a subroutine. Personally (and speculatively) I think it’s likely that the brain performs some highly parallelized advanced MCMC algorithm like Hamiltonian MCMC where each neuron or small group of neurons represents a single ‘particle’ following its own MCMC path. This approach naturally uses the stochastic nature of neural computation to its advantage, and allows neural populations to represent the full posterior distribution rather than just a point prediction as in ANNs.
This paragraph supports my picture that hypotheses about what the brain is doing are still largely being pulled from ML, which speaks against the hypothesis of a growing consensus about what the brain is doing, and also illustrates the lack of direct looking-at-the-brain-and-reporting-what-we-see.
On the other hand, it seems quite plausible that this particular person is especially enthusiastic about analogizing ML algorithms and the brain, since that is what they work on; in which case, this might not be so much evidence about the state of neuroscience as a whole. Some neuroscientist could come in and tell us why all of this stuff is bunk, or perhaps why Predictive Coding is right and all of the other ideas are wrong, or perhaps why the MCMC thing is right and everything else is wrong, etc etc.
But I take it that Daniel isn’t trying to claim that there is a consensus in the field of neuroscience; rather, he’s probably trying to claim that the actual evidence is piling up in favor of predictive coding. I don’t know. Maybe it is. But this particular domain expert doesn’t seem to think so, based on the SSC comment.
To give my position somewhat more detail:
I think the methods of neuroscience are mostly not up to the task. This is based on the paper which applied neuroscience methods to try to reverse-engineer the CPU.
I think what we have are essentially a bunch of guesses about functionality based on correlations and fairly blunt interventional methods (lesioning), combined with the ideas we’ve come up with about what kinds of algorithms the brain might be running (largely pulling from artificial intelligence for ideas).
It makes a lot of sense to me that the brain does something resembling belief propagation on bayes nets. (I take this to be the core idea of predictive coding.) However:
There are a lot of different algorithms resembling belief prop. Sticking within the big tent of “variational methods”, there are a lot of different variational objectives, which result in different algorithms. The brain could be using a variation which we’re unfamiliar with. This could result in significant differences from backprop. (I’m still fond of Hinton’s analogy between contrastive divergence and dreaming, for example. It’s a bit like saying that dreams are GAN-generated adversarial examples, and the brain trains to anti-learn these examples during the night, which results in improved memory consolidation and conceptual clarity during the day. Isn’t that a nice story?)
There are a lot of graphical models besides Bayesian networks. Many of them are “basically the same”, but for example SPNs (sum-product networks) are very different. There’s a sense in which Bayesian networks assume everything is neatly organized into variables already, while SPNs don’t. Also, SPNs are fundamentally faster, so the convergence step in the paper (the step which makes predictive coding 100x slower than belief prop) becomes fast. So SPNs could be a very reasonable alternative, which might not amount to backprop as we know it.
I think it could easily be that the neocortex is explained by some version of predictive coding, but other important elements of the brain are not. In particular, I think the numerical logic of reinforcement learning isn’t easily and efficiently captured via graphical models. I could be ignorant here, but what I know of attempts to fit RL into a predictive-processing paradigm ended up using multiplicative rewards rather than additive (so, you multiply in the new reward rather than adding), simply because adding up a bunch of stuff isn’t natural in graphical models. I think that’s a sign that it’s not the right paradigm.
Radical Probabilism / Logical Uncertainty / Logical Induction makes it generally seem pretty probable, almost necessary, that there’s also some “non-Bayesian” stuff going on in the brain (ie generalized-bayesian, ie non-bayesian updates). This doesn’t seem well-described by predictive coding. This could easily be enough to ruin the analogy between the brain and backprop.
And finally, reiterating the earlier point: there are other algorithms which are more data-efficient than backprop. If humans appear to be more efficient than backprop, then it seems plausible that humans are using a more data-efficient algorithm.
As for the [predictive coding → backprop] link, well, that’s not a crux for me right now, because I was mainly curious why you think such a link, if true, would be evidence against “the brain uses something else that backprop”. I think I understand why you would think that, now, sans what the mounting evidence is.
I think my main crux is the question: (for some appropriate architecture, ie, not necessarily transformers) do human-brain-sized networks, with human-like opportunities for transfer learning, achieve human-level data-efficiency? If so, I have no objection to the hypothesis that the brain uses something more-or-less equivalent to gradient descent.
Thanks for this reply!
--I thought the paper about the methods of neuroscience applied to computers was cute, and valuable, but I don’t think it’s fair to conclude “methods are not up to the task.” But you later said that “It makes a lot of sense to me that the brain does something resembling belief propagation on bayes nets. (I take this to be the core idea of predictive coding.)” so you aren’t a radical skeptic about what we can know about the brain so maybe we don’t disagree after all.
1 − 3: OK, I think I’ll defer to your expertise on these points.
4, 5: Whoa whoa, just because we humans do some non-bayesian stuff and some better-than-backprop stuff doesn’t mean that the brain isn’t running pure bayes nets or backprop-approximation or whatever at the low level! That extra fancy cool stuff we do could be happening at a higher level of abstraction. Networks in the brain learned via backprop-approximation could themselves be doing the logical induction stuff and the super-efficient-learning stuff. In which case we should expect that big NN’s trained via backprop might also stumble across similar networks which would then do similarly cool stuff.
Indeed. Your crux is my question and my crux is your question. (My crux was: Does the brain, at the low level, use something more or less equivalent to the stuff modern NN’s do at a low level? From this I hoped to decide whether human-brain-sized networks could have human-level efficiency)
How could this address point #5? If GD is slow, then GD would be slow to learn faster learning methods.
All of the following is intended as concrete examples against the pure-bayes-brain hypothesis, not as evidence against the brain doing some form of GD:
One thing the brain could be doing under the hood is some form of RL using value-prop. This is difficult to represent in Bayes nets. The attempts I’ve seen end up making reward multiplicative rather than additive across time, which makes sense because Bayes nets are great at multiplying things but not so great at representing additive structures. I think this is OK (we could regard it as an exponential transform of usual reward) until we want to represent temporal discounting. Another problem with this is: representing via graphical models means representing the full distribution over reward values, rather than a point estimate. But this is inefficient compared with regular tabular RL.
Another thing the brain could be doing under the hood is “memory-network” style reasoning which learns a policy for utilizing various forms of memory (visual working memory, auditory working memory, episodic memory, semantic memory...) for reasoning. Because this is fundamentally about logical uncertainty (being unsure about the outcome at the end of some mental work), it’s not very well-represented by Bayesian models. It probably makes more sense to use (model-free) RL to learn how to use WM.
Of course both of those objections could be overcome with a specific sort of work, showing how to represent the desired algorithm in bayes nets.
As for GD:
My back of the envelope calculation suggests that GPT-3 has trained on 7 orders of magnitude more data than a 10yo has experienced in their lifetime. Of course a different NN architecture (+ different task, different loss functions, etc) could just be that much more efficient than transformers; but overall, this doesn’t look good for the human-GD hypothesis.
Maybe your intention is to argue that we use GD with a really good prior, though! This seems much harder to dismiss.
Where does the gradient come from? Providing a gradient is a difficult problem which requires intelligence.
Even if the raw loss(/reward) function is simple and fixed, it’s difficult to turn that into a gradient for learning, because you don’t know how to attribute punishment/loss to specific outputs (actions or cognitive acts). The dumb method, policy-gradient, is highly data inefficient due to attributing reward/punishment to all recent actions (frequently providing spurious gradients which adjust weights up/down noisily).
But, quite possibly, the raw loss/reward function is not simple/fixed, but rather, requires significant inference itself. An example of this is imprinting.
The last two sub-points only argue “against GD” in so far as you mean to suggest that the brain “just uses GD” (where “just” is doing a lot of work). My claim there is that more learning principles are needed (for example, model-based learning) to understand what’s going on.
Since the brain is difficult to pin down but ML experiments are not, I would think the more natural direction of inference would be to check the scaling laws and see whether it’s plausible that the brain is within the same regime.
Thanks for the great back-and-forth! Did you guys see the first author’s comment? What are the main updates you’ve had re this debate now that it’s been a couple years?
I have not thought about these issues too much in the intervening time. Re-reading the discussion, it sounds plausible to me that the evidence is compatible with roughly brain-sized NNs being roughly as data-efficient as humans. Daniel claims:
I think the human observation-reaction loop is closer to ten times that fast, which results in a 3 OOM difference. This sounds like a gap which is big, but could potentially be explained by architectural differences or other factors, thus preserving a possibility like “human learning is more-or-less gradient descent”. Without articulating the various hypotheses in more detail, this doesn’t seem like strong evidence in any direction.
Not before now. I think the comment had a relatively high probability in my world, where we still have a poor idea of what algorithm the brain is running, and a low probability in Daniel’s world, where evidence is zooming in on predictive coding as the correct hypothesis. Some quotes which I think support my hypothesis better than Daniel’s:
This illustrates how we haven’t pinned down the mechanical parts of algorithms. What this means is that speculation about the algorithm of the brain isn’t yet causally grounded—it’s not as if we’ve been looking at what’s going on and can build up a firm abstract picture of the algorithm from there, the way you might successfully infer rules of traffic by watching a bunch of cars. Instead, we have a bunch of different kinds of information at different resolutions, which we are still trying to stitch together into a coherent picture.
This directly addresses the question of how clear-cut things are right now, while also pointing to many concrete problems the predictive coding hypothesis faces. The comment continues on that subject for several more paragraphs.
This paragraph supports my picture that hypotheses about what the brain is doing are still largely being pulled from ML, which speaks against the hypothesis of a growing consensus about what the brain is doing, and also illustrates the lack of direct looking-at-the-brain-and-reporting-what-we-see.
On the other hand, it seems quite plausible that this particular person is especially enthusiastic about analogizing ML algorithms and the brain, since that is what they work on; in which case, this might not be so much evidence about the state of neuroscience as a whole. Some neuroscientist could come in and tell us why all of this stuff is bunk, or perhaps why Predictive Coding is right and all of the other ideas are wrong, or perhaps why the MCMC thing is right and everything else is wrong, etc etc.
But I take it that Daniel isn’t trying to claim that there is a consensus in the field of neuroscience; rather, he’s probably trying to claim that the actual evidence is piling up in favor of predictive coding. I don’t know. Maybe it is. But this particular domain expert doesn’t seem to think so, based on the SSC comment.