It does provide a small amount of evidence against it, because it shown one specific algorithm is “basically backprop”. Maybe you’re saying this is significant evidence, because we have some evidence that predictive coding is also the algorithm the brain actually uses.
But we also know there are algorithms which are way more data-efficient than NNs (while being more processing-power intensive). So wouldn’t the obvious conclusion from our observations be: humans don’t use backprop, but rather, use more data-efficient algorithms?
I’ll grant, I’m now quite curious how the scaling argument works out. Is it plausible that human-brain-sized NNs are as data-efficient as humans?
I guess I was thinking: Brains use predictive coding, and predictive coding is basically backprop, so brains can’t be using something dramatically better than backprop. You are objecting to the “brains use predictive coding” step? Or are you objecting that only one particular version of predictive coding is basically backprop?
But we also know there are algorithms which are way more data-efficient than NNs (while being more processing-power intensive). So wouldn’t the obvious conclusion from our observations be: humans don’t use backprop, but rather, use more data-efficient algorithms?
Are you referring to Solomonoff Induction and the like? I think the “brains use more data-efficient algorithms” is an obvious hypothesis but not an obvious conclusion—there are several competing hypotheses, outlined above. (And I think the evidence against it is mounting, this being one of the key pieces.)
I’ll grant, I’m now quite curious how the scaling argument works out. Is it plausible that human-brain-sized NNs are as data-efficient as humans?
In terms of bits/pixels/etc., humans see plenty of data in their lifetime, a bit more than the scaling laws would predict IIRC. But the scaling laws (as interpreted by Ajeya, Rohin, etc.) are about the amount of subjective time the model needs to run before you can evaluate the result. If we assume for humans it’s something like 1 second on average (because our brains are evaluating-and-updating weights etc. on about that timescale) then we have a mere 10^9 data points, which is something like 4 OOMs less than the scaling laws would predict. If instead we think it’s longer, then the gap in data-efficiency grows.
Some issues though. One, the scaling laws might not be the same for all architectures. Maybe if your context window is bigger, or your use recurrency, or whatever, the laws are different. Too early to tell, at least for me (maybe others have more confident opinions, I’d love to hear them!) Two, some data is higher-quality than other data, and plausibly human data is higher-quality than the stuff GPT-3 was fed—e.g. humans deliberately seek out data that teaches them stuff they want to know, instead of just dully staring at a firehose of random stuff. Three, it’s not clear how to apply this to humans anyway. Maybe our neurons are updating a hundred times a second or something.
I’d be pretty surprised if a human-brain-sized Transformer was able to get as good as a human at most important human tasks simply by seeing a firehose of 10^9 images or context windows of internet data. But I’d also be pretty surprised (10%) if the scaling laws turn out to be so universal that we can’t get around them; if it turns out that transformative tasks really do require a NN at least the size of a human brain trained for at least 10^14 steps or so where each step involves running the NN for at least a subjective week. (Subjective second, I’d find more plausible. Or subjective week (or longer) but with fewer than 10^14 steps.)
You are objecting to the “brains use predictive coding” step? Or are you objecting that only one particular version of predictive coding is basically backprop?
Yeah, somewhere along that spectrum. Generally speaking, I’m skeptical of claims that we know a lot about the brain.
Are you referring to Solomonoff Induction and the like?
I was more thinking of genetic programming.
I think the “brains use more data-efficient algorithms” is an obvious hypothesis but not an obvious conclusion—there are several competing hypotheses, outlined above.
I agree with this.
(And I think the evidence against it is mounting, this being one of the key pieces.)
Yeah, somewhere along that spectrum. Generally speaking, I’m skeptical of claims that we know a lot about the brain.
“(And I think the evidence against it is mounting, this being one of the key pieces.)”
(I still don’t see why.)
--I wouldn’t characterize my own position as “we know a lot about the brain.” I think we should taboo “a lot.”
--We are at an impasse here I guess—I think there’s mounting evidence that brains use predictive coding and mounting evidence that predictive coding is like backprop. I agree it’s not conclusive but this paper seems to be pushing in that direction and there are others like it IIRC. I’m guessing you just are significantly more skeptical of both predictive coding and the predictive coding --> backprop link than I am… perhaps because the other hypotheses on my list are less plausible to you?
I think what we have are essentially a bunch of guesses about functionality based on correlations and fairly blunt interventional methods (lesioning), combined with the ideas we’ve come up with about what kinds of algorithms the brain might be running (largely pulling from artificial intelligence for ideas).
I’m guessing you just are significantly more skeptical of both predictive coding and the predictive coding --> backprop link than I am… perhaps because the other hypotheses on my list are less plausible to you?
It makes a lot of sense to me that the brain does something resembling belief propagation on bayes nets. (I take this to be the core idea of predictive coding.) However:
There are a lot of different algorithms resembling belief prop. Sticking within the big tent of “variational methods”, there are a lot of different variational objectives, which result in different algorithms. The brain could be using a variation which we’re unfamiliar with. This could result in significant differences from backprop. (I’m still fond of Hinton’s analogy between contrastive divergence and dreaming, for example. It’s a bit like saying that dreams are GAN-generated adversarial examples, and the brain trains to anti-learn these examples during the night, which results in improved memory consolidation and conceptual clarity during the day. Isn’t that a nice story?)
There are a lot of graphical models besides Bayesian networks. Many of them are “basically the same”, but for example SPNs (sum-product networks) are very different. There’s a sense in which Bayesian networks assume everything is neatly organized into variables already, while SPNs don’t. Also, SPNs are fundamentally faster, so the convergence step in the paper (the step which makes predictive coding 100x slower than belief prop) becomes fast. So SPNs could be a very reasonable alternative, which might not amount to backprop as we know it.
I think it could easily be that the neocortex is explained by some version of predictive coding, but other important elements of the brain are not. In particular, I think the numerical logic of reinforcement learning isn’t easily and efficiently captured via graphical models. I could be ignorant here, but what I know of attempts to fit RL into a predictive-processing paradigm ended up using multiplicative rewards rather than additive (so, you multiply in the new reward rather than adding), simply because adding up a bunch of stuff isn’t natural in graphical models. I think that’s a sign that it’s not the right paradigm.
Radical Probabilism / Logical Uncertainty / Logical Induction makes it generally seem pretty probable, almost necessary, that there’s also some “non-Bayesian” stuff going on in the brain (ie generalized-bayesian, ie non-bayesian updates). This doesn’t seem well-described by predictive coding. This could easily be enough to ruin the analogy between the brain and backprop.
And finally, reiterating the earlier point: there are other algorithms which are more data-efficient than backprop. If humans appear to be more efficient than backprop, then it seems plausible that humans are using a more data-efficient algorithm.
As for the [predictive coding → backprop] link, well, that’s not a crux for me right now, because I was mainly curious why you think such a link, if true, would be evidence against “the brain uses something else that backprop”. I think I understand why you would think that, now, sans what the mounting evidence is.
I think my main crux is the question: (for some appropriate architecture, ie, not necessarily transformers) do human-brain-sized networks, with human-like opportunities for transfer learning, achieve human-level data-efficiency? If so, I have no objection to the hypothesis that the brain uses something more-or-less equivalent to gradient descent.
--I thought the paper about the methods of neuroscience applied to computers was cute, and valuable, but I don’t think it’s fair to conclude “methods are not up to the task.” But you later said that “It makes a lot of sense to me that the brain does something resembling belief propagation on bayes nets. (I take this to be the core idea of predictive coding.)” so you aren’t a radical skeptic about what we can know about the brain so maybe we don’t disagree after all.
1 − 3: OK, I think I’ll defer to your expertise on these points.
4, 5: Whoa whoa, just because we humans do some non-bayesian stuff and some better-than-backprop stuff doesn’t mean that the brain isn’t running pure bayes nets or backprop-approximation or whatever at the low level! That extra fancy cool stuff we do could be happening at a higher level of abstraction. Networks in the brain learned via backprop-approximation could themselves be doing the logical induction stuff and the super-efficient-learning stuff. In which case we should expect that big NN’s trained via backprop might also stumble across similar networks which would then do similarly cool stuff.
I think my main crux is the question: (for some appropriate architecture, ie, not necessarily transformers) do human-brain-sized networks, with human-like opportunities for transfer learning, achieve human-level data-efficiency?
Indeed. Your crux is my question and my crux is your question. (My crux was: Does the brain, at the low level, use something more or less equivalent to the stuff modern NN’s do at a low level? From this I hoped to decide whether human-brain-sized networks could have human-level efficiency)
4, 5: Whoa whoa, just because we humans do some non-bayesian stuff and some better-than-backprop stuff doesn’t mean that the brain isn’t running pure bayes nets or backprop-approximation or whatever at the low level! That extra fancy cool stuff we do could be happening at a higher level of abstraction. Networks in the brain learned via backprop-approximation could themselves be doing the logical induction stuff and the super-efficient-learning stuff. In which case we should expect that big NN’s trained via backprop might also stumble across similar networks which would then do similarly cool stuff.
How could this address point #5? If GD is slow, then GD would be slow to learn faster learning methods.
All of the following is intended as concrete examples against the pure-bayes-brain hypothesis, not as evidence against the brain doing some form of GD:
One thing the brain could be doing under the hood is some form of RL using value-prop. This is difficult to represent in Bayes nets. The attempts I’ve seen end up making reward multiplicative rather than additive across time, which makes sense because Bayes nets are great at multiplying things but not so great at representing additive structures. I think this is OK (we could regard it as an exponential transform of usual reward) until we want to represent temporal discounting. Another problem with this is: representing via graphical models means representing the full distribution over reward values, rather than a point estimate. But this is inefficient compared with regular tabular RL.
Another thing the brain could be doing under the hood is “memory-network” style reasoning which learns a policy for utilizing various forms of memory (visual working memory, auditory working memory, episodic memory, semantic memory...) for reasoning. Because this is fundamentally about logical uncertainty (being unsure about the outcome at the end of some mental work), it’s not very well-represented by Bayesian models. It probably makes more sense to use (model-free) RL to learn how to use WM.
Of course both of those objections could be overcome with a specific sort of work, showing how to represent the desired algorithm in bayes nets.
As for GD:
My back of the envelope calculation suggests that GPT-3 has trained on 7 orders of magnitude more data than a 10yo has experienced in their lifetime. Of course a different NN architecture (+ different task, different loss functions, etc) could just be that much more efficient than transformers; but overall, this doesn’t look good for the human-GD hypothesis.
Maybe your intention is to argue that we use GD with a really good prior, though! This seems much harder to dismiss.
Even if the raw loss(/reward) function is simple and fixed, it’s difficult to turn that into a gradient for learning, because you don’t know how to attribute punishment/loss to specific outputs (actions or cognitive acts). The dumb method, policy-gradient, is highly data inefficient due to attributing reward/punishment to all recent actions (frequently providing spurious gradients which adjust weights up/down noisily).
But, quite possibly, the raw loss/reward function is not simple/fixed, but rather, requires significant inference itself. An example of this is imprinting.
The last two sub-points only argue “against GD” in so far as you mean to suggest that the brain “just uses GD” (where “just” is doing a lot of work). My claim there is that more learning principles are needed (for example, model-based learning) to understand what’s going on.
Indeed. Your crux is my question and my crux is your question. (My crux was: Does the brain, at the low level, use something more or less equivalent to the stuff modern NN’s do at a low level? From this I hoped to decide whether human-brain-sized networks could have human-level efficiency)
Since the brain is difficult to pin down but ML experiments are not, I would think the more natural direction of inference would be to check the scaling laws and see whether it’s plausible that the brain is within the same regime.
Thanks for the great back-and-forth! Did you guys see the first author’s comment? What are the main updates you’ve had re this debate now that it’s been a couple years?
I have not thought about these issues too much in the intervening time. Re-reading the discussion, it sounds plausible to me that the evidence is compatible with roughly brain-sized NNs being roughly as data-efficient as humans. Daniel claims:
If we assume for humans it’s something like 1 second on average (because our brains are evaluating-and-updating weights etc. on about that timescale) then we have a mere 10^9 data points, which is something like 4 OOMs less than the scaling laws would predict. If instead we think it’s longer, then the gap in data-efficiency grows.
I think the human observation-reaction loop is closer to ten times that fast, which results in a 3 OOM difference. This sounds like a gap which is big, but could potentially be explained by architectural differences or other factors, thus preserving a possibility like “human learning is more-or-less gradient descent”. Without articulating the various hypotheses in more detail, this doesn’t seem like strong evidence in any direction.
Not before now. I think the comment had a relatively high probability in my world, where we still have a poor idea of what algorithm the brain is running, and a low probability in Daniel’s world, where evidence is zooming in on predictive coding as the correct hypothesis. Some quotes which I think support my hypothesis better than Daniel’s:
If we (speculatively) associate alpha/beta waves with iterations in predictive coding,
This illustrates how we haven’t pinned down the mechanical parts of algorithms. What this means is that speculation about the algorithm of the brain isn’t yet causally grounded—it’s not as if we’ve been looking at what’s going on and can build up a firm abstract picture of the algorithm from there, the way you might successfully infer rules of traffic by watching a bunch of cars. Instead, we have a bunch of different kinds of information at different resolutions, which we are still trying to stitch together into a coherent picture.
While it’s often claimed that predictive coding is biologically plausible and the best explanation for cortical function, this isn’t really all that clear cut. Firstly, predictive coding itself actually has a bunch of implausibilities. Predictive coding suffers from the same weight transport problem as backprop, and secondly it requires that the prediction and prediction error neurons are 1-1 (i.e. one prediction error neuron for every prediction neuron) which is way too precise connectivity to actually happen in the brain. I’ve been working on ways to adapt predictive coding around these problems as in this paper (https://arxiv.org/pdf/2010.01047.pdf), but this work is currently very preliminary and its unclear if the remedies proposed here will scale to larger architectures.
This directly addresses the question of how clear-cut things are right now, while also pointing to many concrete problems the predictive coding hypothesis faces. The comment continues on that subject for several more paragraphs.
The brain being able to do backprop does not mean that the brain is just doing gradient descent like we do to train ANNs. It is still very possible (in my opinion likely) that the brain could be using a more powerful algorithm for inference and learning—just one that has backprop as a subroutine. Personally (and speculatively) I think it’s likely that the brain performs some highly parallelized advanced MCMC algorithm like Hamiltonian MCMC where each neuron or small group of neurons represents a single ‘particle’ following its own MCMC path. This approach naturally uses the stochastic nature of neural computation to its advantage, and allows neural populations to represent the full posterior distribution rather than just a point prediction as in ANNs.
This paragraph supports my picture that hypotheses about what the brain is doing are still largely being pulled from ML, which speaks against the hypothesis of a growing consensus about what the brain is doing, and also illustrates the lack of direct looking-at-the-brain-and-reporting-what-we-see.
On the other hand, it seems quite plausible that this particular person is especially enthusiastic about analogizing ML algorithms and the brain, since that is what they work on; in which case, this might not be so much evidence about the state of neuroscience as a whole. Some neuroscientist could come in and tell us why all of this stuff is bunk, or perhaps why Predictive Coding is right and all of the other ideas are wrong, or perhaps why the MCMC thing is right and everything else is wrong, etc etc.
But I take it that Daniel isn’t trying to claim that there is a consensus in the field of neuroscience; rather, he’s probably trying to claim that the actual evidence is piling up in favor of predictive coding. I don’t know. Maybe it is. But this particular domain expert doesn’t seem to think so, based on the SSC comment.
--Human brains have special architectures, various modules that interact in various ways (priors?)
--Human brains don’t use Backprop; maybe they have some sort of even-better algorithm
This is a funny distinction to me. These things seem like two ends of a spectrum (something like, the physical scale of “one unit of structure”; predictive coding is few-neuron-scale, modules are big-brain-chunk scale; in between, there’s micro-columns, columns, lamina, feedback circuits, relays, fiber bundles; and below predictive coding there’s the rules for dendrite and synapse change).
I wouldn’t characterize my own position as “we know a lot about the brain.” I think we should taboo “a lot.”
I think there’s mounting evidence that brains use predictive coding
Are you saying, there’s mounting evidence that predictive coding screens off all lower levels from all higher levels? Like all high-level phenomena are the result of predictive coding, plus an architecture that hooks up bits of predictive coding together?
How does it “rule out” the last one??
It does provide a small amount of evidence against it, because it shown one specific algorithm is “basically backprop”. Maybe you’re saying this is significant evidence, because we have some evidence that predictive coding is also the algorithm the brain actually uses.
But we also know there are algorithms which are way more data-efficient than NNs (while being more processing-power intensive). So wouldn’t the obvious conclusion from our observations be: humans don’t use backprop, but rather, use more data-efficient algorithms?
I’ll grant, I’m now quite curious how the scaling argument works out. Is it plausible that human-brain-sized NNs are as data-efficient as humans?
I guess I was thinking: Brains use predictive coding, and predictive coding is basically backprop, so brains can’t be using something dramatically better than backprop. You are objecting to the “brains use predictive coding” step? Or are you objecting that only one particular version of predictive coding is basically backprop?
Are you referring to Solomonoff Induction and the like? I think the “brains use more data-efficient algorithms” is an obvious hypothesis but not an obvious conclusion—there are several competing hypotheses, outlined above. (And I think the evidence against it is mounting, this being one of the key pieces.)
In terms of bits/pixels/etc., humans see plenty of data in their lifetime, a bit more than the scaling laws would predict IIRC. But the scaling laws (as interpreted by Ajeya, Rohin, etc.) are about the amount of subjective time the model needs to run before you can evaluate the result. If we assume for humans it’s something like 1 second on average (because our brains are evaluating-and-updating weights etc. on about that timescale) then we have a mere 10^9 data points, which is something like 4 OOMs less than the scaling laws would predict. If instead we think it’s longer, then the gap in data-efficiency grows.
Some issues though. One, the scaling laws might not be the same for all architectures. Maybe if your context window is bigger, or your use recurrency, or whatever, the laws are different. Too early to tell, at least for me (maybe others have more confident opinions, I’d love to hear them!) Two, some data is higher-quality than other data, and plausibly human data is higher-quality than the stuff GPT-3 was fed—e.g. humans deliberately seek out data that teaches them stuff they want to know, instead of just dully staring at a firehose of random stuff. Three, it’s not clear how to apply this to humans anyway. Maybe our neurons are updating a hundred times a second or something.
I’d be pretty surprised if a human-brain-sized Transformer was able to get as good as a human at most important human tasks simply by seeing a firehose of 10^9 images or context windows of internet data. But I’d also be pretty surprised (10%) if the scaling laws turn out to be so universal that we can’t get around them; if it turns out that transformative tasks really do require a NN at least the size of a human brain trained for at least 10^14 steps or so where each step involves running the NN for at least a subjective week. (Subjective second, I’d find more plausible. Or subjective week (or longer) but with fewer than 10^14 steps.)
Yeah, somewhere along that spectrum. Generally speaking, I’m skeptical of claims that we know a lot about the brain.
I was more thinking of genetic programming.
I agree with this.
(I still don’t see why.)
--I wouldn’t characterize my own position as “we know a lot about the brain.” I think we should taboo “a lot.”
--We are at an impasse here I guess—I think there’s mounting evidence that brains use predictive coding and mounting evidence that predictive coding is like backprop. I agree it’s not conclusive but this paper seems to be pushing in that direction and there are others like it IIRC. I’m guessing you just are significantly more skeptical of both predictive coding and the predictive coding --> backprop link than I am… perhaps because the other hypotheses on my list are less plausible to you?
To give my position somewhat more detail:
I think the methods of neuroscience are mostly not up to the task. This is based on the paper which applied neuroscience methods to try to reverse-engineer the CPU.
I think what we have are essentially a bunch of guesses about functionality based on correlations and fairly blunt interventional methods (lesioning), combined with the ideas we’ve come up with about what kinds of algorithms the brain might be running (largely pulling from artificial intelligence for ideas).
It makes a lot of sense to me that the brain does something resembling belief propagation on bayes nets. (I take this to be the core idea of predictive coding.) However:
There are a lot of different algorithms resembling belief prop. Sticking within the big tent of “variational methods”, there are a lot of different variational objectives, which result in different algorithms. The brain could be using a variation which we’re unfamiliar with. This could result in significant differences from backprop. (I’m still fond of Hinton’s analogy between contrastive divergence and dreaming, for example. It’s a bit like saying that dreams are GAN-generated adversarial examples, and the brain trains to anti-learn these examples during the night, which results in improved memory consolidation and conceptual clarity during the day. Isn’t that a nice story?)
There are a lot of graphical models besides Bayesian networks. Many of them are “basically the same”, but for example SPNs (sum-product networks) are very different. There’s a sense in which Bayesian networks assume everything is neatly organized into variables already, while SPNs don’t. Also, SPNs are fundamentally faster, so the convergence step in the paper (the step which makes predictive coding 100x slower than belief prop) becomes fast. So SPNs could be a very reasonable alternative, which might not amount to backprop as we know it.
I think it could easily be that the neocortex is explained by some version of predictive coding, but other important elements of the brain are not. In particular, I think the numerical logic of reinforcement learning isn’t easily and efficiently captured via graphical models. I could be ignorant here, but what I know of attempts to fit RL into a predictive-processing paradigm ended up using multiplicative rewards rather than additive (so, you multiply in the new reward rather than adding), simply because adding up a bunch of stuff isn’t natural in graphical models. I think that’s a sign that it’s not the right paradigm.
Radical Probabilism / Logical Uncertainty / Logical Induction makes it generally seem pretty probable, almost necessary, that there’s also some “non-Bayesian” stuff going on in the brain (ie generalized-bayesian, ie non-bayesian updates). This doesn’t seem well-described by predictive coding. This could easily be enough to ruin the analogy between the brain and backprop.
And finally, reiterating the earlier point: there are other algorithms which are more data-efficient than backprop. If humans appear to be more efficient than backprop, then it seems plausible that humans are using a more data-efficient algorithm.
As for the [predictive coding → backprop] link, well, that’s not a crux for me right now, because I was mainly curious why you think such a link, if true, would be evidence against “the brain uses something else that backprop”. I think I understand why you would think that, now, sans what the mounting evidence is.
I think my main crux is the question: (for some appropriate architecture, ie, not necessarily transformers) do human-brain-sized networks, with human-like opportunities for transfer learning, achieve human-level data-efficiency? If so, I have no objection to the hypothesis that the brain uses something more-or-less equivalent to gradient descent.
Thanks for this reply!
--I thought the paper about the methods of neuroscience applied to computers was cute, and valuable, but I don’t think it’s fair to conclude “methods are not up to the task.” But you later said that “It makes a lot of sense to me that the brain does something resembling belief propagation on bayes nets. (I take this to be the core idea of predictive coding.)” so you aren’t a radical skeptic about what we can know about the brain so maybe we don’t disagree after all.
1 − 3: OK, I think I’ll defer to your expertise on these points.
4, 5: Whoa whoa, just because we humans do some non-bayesian stuff and some better-than-backprop stuff doesn’t mean that the brain isn’t running pure bayes nets or backprop-approximation or whatever at the low level! That extra fancy cool stuff we do could be happening at a higher level of abstraction. Networks in the brain learned via backprop-approximation could themselves be doing the logical induction stuff and the super-efficient-learning stuff. In which case we should expect that big NN’s trained via backprop might also stumble across similar networks which would then do similarly cool stuff.
Indeed. Your crux is my question and my crux is your question. (My crux was: Does the brain, at the low level, use something more or less equivalent to the stuff modern NN’s do at a low level? From this I hoped to decide whether human-brain-sized networks could have human-level efficiency)
How could this address point #5? If GD is slow, then GD would be slow to learn faster learning methods.
All of the following is intended as concrete examples against the pure-bayes-brain hypothesis, not as evidence against the brain doing some form of GD:
One thing the brain could be doing under the hood is some form of RL using value-prop. This is difficult to represent in Bayes nets. The attempts I’ve seen end up making reward multiplicative rather than additive across time, which makes sense because Bayes nets are great at multiplying things but not so great at representing additive structures. I think this is OK (we could regard it as an exponential transform of usual reward) until we want to represent temporal discounting. Another problem with this is: representing via graphical models means representing the full distribution over reward values, rather than a point estimate. But this is inefficient compared with regular tabular RL.
Another thing the brain could be doing under the hood is “memory-network” style reasoning which learns a policy for utilizing various forms of memory (visual working memory, auditory working memory, episodic memory, semantic memory...) for reasoning. Because this is fundamentally about logical uncertainty (being unsure about the outcome at the end of some mental work), it’s not very well-represented by Bayesian models. It probably makes more sense to use (model-free) RL to learn how to use WM.
Of course both of those objections could be overcome with a specific sort of work, showing how to represent the desired algorithm in bayes nets.
As for GD:
My back of the envelope calculation suggests that GPT-3 has trained on 7 orders of magnitude more data than a 10yo has experienced in their lifetime. Of course a different NN architecture (+ different task, different loss functions, etc) could just be that much more efficient than transformers; but overall, this doesn’t look good for the human-GD hypothesis.
Maybe your intention is to argue that we use GD with a really good prior, though! This seems much harder to dismiss.
Where does the gradient come from? Providing a gradient is a difficult problem which requires intelligence.
Even if the raw loss(/reward) function is simple and fixed, it’s difficult to turn that into a gradient for learning, because you don’t know how to attribute punishment/loss to specific outputs (actions or cognitive acts). The dumb method, policy-gradient, is highly data inefficient due to attributing reward/punishment to all recent actions (frequently providing spurious gradients which adjust weights up/down noisily).
But, quite possibly, the raw loss/reward function is not simple/fixed, but rather, requires significant inference itself. An example of this is imprinting.
The last two sub-points only argue “against GD” in so far as you mean to suggest that the brain “just uses GD” (where “just” is doing a lot of work). My claim there is that more learning principles are needed (for example, model-based learning) to understand what’s going on.
Since the brain is difficult to pin down but ML experiments are not, I would think the more natural direction of inference would be to check the scaling laws and see whether it’s plausible that the brain is within the same regime.
Thanks for the great back-and-forth! Did you guys see the first author’s comment? What are the main updates you’ve had re this debate now that it’s been a couple years?
I have not thought about these issues too much in the intervening time. Re-reading the discussion, it sounds plausible to me that the evidence is compatible with roughly brain-sized NNs being roughly as data-efficient as humans. Daniel claims:
I think the human observation-reaction loop is closer to ten times that fast, which results in a 3 OOM difference. This sounds like a gap which is big, but could potentially be explained by architectural differences or other factors, thus preserving a possibility like “human learning is more-or-less gradient descent”. Without articulating the various hypotheses in more detail, this doesn’t seem like strong evidence in any direction.
Not before now. I think the comment had a relatively high probability in my world, where we still have a poor idea of what algorithm the brain is running, and a low probability in Daniel’s world, where evidence is zooming in on predictive coding as the correct hypothesis. Some quotes which I think support my hypothesis better than Daniel’s:
This illustrates how we haven’t pinned down the mechanical parts of algorithms. What this means is that speculation about the algorithm of the brain isn’t yet causally grounded—it’s not as if we’ve been looking at what’s going on and can build up a firm abstract picture of the algorithm from there, the way you might successfully infer rules of traffic by watching a bunch of cars. Instead, we have a bunch of different kinds of information at different resolutions, which we are still trying to stitch together into a coherent picture.
This directly addresses the question of how clear-cut things are right now, while also pointing to many concrete problems the predictive coding hypothesis faces. The comment continues on that subject for several more paragraphs.
This paragraph supports my picture that hypotheses about what the brain is doing are still largely being pulled from ML, which speaks against the hypothesis of a growing consensus about what the brain is doing, and also illustrates the lack of direct looking-at-the-brain-and-reporting-what-we-see.
On the other hand, it seems quite plausible that this particular person is especially enthusiastic about analogizing ML algorithms and the brain, since that is what they work on; in which case, this might not be so much evidence about the state of neuroscience as a whole. Some neuroscientist could come in and tell us why all of this stuff is bunk, or perhaps why Predictive Coding is right and all of the other ideas are wrong, or perhaps why the MCMC thing is right and everything else is wrong, etc etc.
But I take it that Daniel isn’t trying to claim that there is a consensus in the field of neuroscience; rather, he’s probably trying to claim that the actual evidence is piling up in favor of predictive coding. I don’t know. Maybe it is. But this particular domain expert doesn’t seem to think so, based on the SSC comment.
This is a funny distinction to me. These things seem like two ends of a spectrum (something like, the physical scale of “one unit of structure”; predictive coding is few-neuron-scale, modules are big-brain-chunk scale; in between, there’s micro-columns, columns, lamina, feedback circuits, relays, fiber bundles; and below predictive coding there’s the rules for dendrite and synapse change).
Are you saying, there’s mounting evidence that predictive coding screens off all lower levels from all higher levels? Like all high-level phenomena are the result of predictive coding, plus an architecture that hooks up bits of predictive coding together?