Interestingly, I think you put a lot more importance on this “synthetic data” than I do. I want to call the “synthetic data” thing “a neat trick that maybe speeds up development by a few days or something”. If retinal waves are more important than that, I’d be inclined to think that they have some other role instead of (or in addition to) “synthetic data”.
It seems to me that retinal waves just don’t carry all that many bits of information, not enough for it to be really spiritually similar to the ML notion of “pretraining a model”, or to explain why current ML models need so much more data. I mean, compare the information content of retinal waves versus the information content of an adult human brain, or even adult rat brain. It has to be a teeny tiny fraction of a percent.
If I was talking about the role of evolution, I would talk instead about how evolution designed the learning algorithm, and inference algorithm, and neural architecture, and hyperparameter settings (some of which are dynamic depending on both where-we-are-in-development and moment-by-moment arousal level etc.), and especially the reward function. And then I would throw “oh and evolution also designed synthetic data for pretraining” into a footnote or something. :-P
I’m not sure exactly why we disagree on this, or how to resolve it. You’re obviously more knowledgeable about the details here, I dunno.
Yeah, it’s a tricky situation for me. The thesis that spontaneous activity is important is very central to my research, so I have a lot of incentives to believe in it. And I’m also exposed to a lot of evidence in its favor. We should probably swap roles (I should argue against and you for the importance) to debias. In case you’re ever interested in trying that out (or in having an adversarial collaboration about this topic) let me know :)
I believe that the cortex essentially just does some form of gradient descent/backpropagation in canonical neural circuits that updates internal models. (The subcortex might be different.) I define “gradient descent” generously as “any procedure that uses or approximates the gradient of a loss function as the central component to reduce loss”. All the complications stem from the fact that a biological neural net is not great at accurately propagating the error signal backward, so evolution came up with a ton of tricks & hacks to make it work anyhow (see this paper from UCL & Deepmind for some ideas on how exactly). I have two main reasons to believe this:
Gradient descent is pretty easy to implement with neurons and simultaneously general that just on a complexity prior it’s a strong candidate for any solution that a meta-optimizer like evolution might come up with. Anything more complicated would not be working as robustly across all relevant domains.
In conjunction with what I believe about spontaneous activity inducing very strong & informative priors, I don’t think there is any need for anything more complicated than gradient descent. At least I don’t intuitively see the necessity of more optimized learning algorithms (except to maybe squeeze out a few more percentage points of performance).
I notice that there are a lot fewer green links in the second point, which also nicely indicates my relative level of certainty about that compared to the first point.
Thanks! Oh it’s fine, we can just have a normal discussion. :) Just let me know if I’m insulting your work or stressing you out. :-)
I believe that spontaneous activity is quite rich in information.
Sure. But real-world sensory data is quite rich in information too. I guess my question is: What’s the evidence that the spontaneous activity / “synthetic data” (e.g. retinal waves) is doing things that stimulated activity / “actual data” (e.g. naturalistic visual scenes) can’t do by itself? E.g. “the statistics of spontaneous activity and stimulus-evoked activity are quite similar and get more similar over development” seems to be evidence against the importance of the data being synthetic, because it suggests that actual data would also work equally well. So that would be the “shave a few days off development” story.
In conjunction with what I believe about spontaneous activity inducing very strong & informative priors …
The brain (well, cortex & cerebellum, not so much the brainstem or hypothalamus) does “online learning”. So my “prior” keeps getting better and better. So right now I’m ~1,200,000,000 seconds old, and if I see some visual stimulus right now, the “prior” that I use to process that visual stimulus is informed by everything that I’ve learned in the previous 1,199,999,999 seconds of life, oh plus the previous 21,000,000 seconds in the womb (including retinal waves), plus whatever “prior” you think was hardcoded by the genome (e.g. cortico-cortico connections between certain pairs of regions are more likely to form than other pairs of regions, just because they’re close together and/or heavily seeded at birth with random connections).
Anyway, the point is, I’m not sure if you’re personally doing this, but I do sometimes see a tendency to conflate “prior” with “genetically-hardcoded information”, especially within the predictive processing literature, and I’m trying to push back on that. I agree with the generic idea that “priors are very important” but that doesn’t immediately imply that the things your cortex learns in 10 days (or whatever) of retinal waves are fundamentally different from and more important than the things your cortex learns in the subsequent 10 days of open-eye naturalistic visual stimulation. I think it’s just always true that the first 10 days of data are the prior for the 11th day, and the first 11 days of data are the prior for the 12th day, and the first 12 days of data … etc. etc. And in any particular case, that prior data may be composed of exogenous data vs synthetic data vs some combination of both, but whatever, it’s all going into the same prior either way.
What’s the evidence that the spontaneous activity / “synthetic data” (e.g. retinal waves) is doing things that stimulated activity / “actual data” (e.g. naturalistic visual scenes) can’t do by itself?
I don’t think direct evidence for this exists. Tbf, this would be a very difficult experiment to run (you’d have to replace retinal waves with real data and the retina really wants to generate retinal waves).
But the principled argument that sways me the most is that “real” input is external—its statistics don’t really care about the developmental state of the animal. Spontaneous activity on the other hand changes with development and can (presumably) provide the most “useful” type of input for refining the circuit (as in something like progressive learning). This last step is conjecture and could be investigated with computational models (train the first layer with very coarse retinal waves, the second layer with more refined retinal waves, etc. and see how well the final model performs compared with one trained on an equal number of natural images). I might run that experiment at some point in the future. Any predictions?
a tendency to conflate “prior” with “genetically-hardcoded information”, especially within the predictive processing literature, and I’m trying to push back on that
Hmm, so I agree with the general point that you’re making that “priors are not set in stone” and the whole point is to update on them with sensory data and everything. But I think it’s not fair to treat all seconds of life as equally influential/important for learning. There is a lot of literature demonstrating that the cortex is less plastic during adulthood compared to development. There is also the big difference that during development the location & shape of dendrites and axons change depending on activity, while in adulthood things are a lot more rigid. Any input provided early on will have a disproportionate impact. The classic theory that there are critical periods of plasticity during development is probably too strong (given the right conditions/pharmacological interventions also the adult brain can be very plastic again), but still—there is something special about development.
I’m not sure if that’s the point that people in predictive coding are making or if they are just ignorant that lifelong plasticity is a thing.
it’s not fair to treat all seconds of life as equally influential/important for learning
I agree and didn’t mean to imply otherwise.
In terms of what we’re discussing here, I think it’s worth noting that there’s a big overlap between “sensitive windows in such-and-such part of the cortex” and “the time period when the data is external not synthetic”.
Any predictions?
I dunno….
O’Reilly (1,2) simulated visual cortex development, and found that their learning algorithm flailed around and didn’t learn anything, unless they set it up to learn the where pathway first (with the what pathway disconnected), and only connect up the what pathway after the where pathway training has converged to a good model. (And they say there’s biological evidence for this.) (They didn’t have any retinal waves, just “real” data.)
As that example illustrates, there’s always a risk that a randomly-initialized model won’t converge to a good model upon training, thanks to a bad draw of the random seed. I imagine that there are various “tricks” that reduce the odds of this problem occurring—i.e. to make the loss landscape less bumpy, or something vaguely analogous to that. O’Reilly’s “carefully choreographed (and region-dependent) learning rates” is one such trick. I’m very open-minded to the possibility that “carefully choreographed synthetic data” is another such trick.
Anyway, I don’t particularly object to the idea “synthetic data is useful, and plausibly if you take an existing organism and remove its synthetic data it would get messed up”. I was objecting instead to the idea “synthetic data is a major difference between the performance of brains and deep RL, and thus maybe with the right synthetic data pre-training, deep RL would perform as well as brains”. I think the overwhelming majority of training on human brains involves real data—newborns don’t have object permanence or language or conceptual reasoning or anything like that, and presumably they build all those things out of a diet of actual not synthetic data. And even if you think that the learning algorithm of brains and deep RL is both gradient descent, the inference algorithm is clearly different (e.g. brains use analysis-by-synthesis), and the architectures are clearly different (e.g. brains are full of pairs of neurons where each projects to the other, whereas deep neural nets almost never have that). These are two fundamental differences that persist for the entire lifetime / duration of training, unlike synthetic data which only appears near the start. Also, the ML community has explored things like deep neural net weight initialization and curriculum learning plenty, I would just be very surprised if massive transformative performance improvements (like a big fraction of the difference between where we are and AGI) could come out of those kinds of investigation, as opposed to coming out of different architectures and learning algorithms and training data.
Thanks!
Interestingly, I think you put a lot more importance on this “synthetic data” than I do. I want to call the “synthetic data” thing “a neat trick that maybe speeds up development by a few days or something”. If retinal waves are more important than that, I’d be inclined to think that they have some other role instead of (or in addition to) “synthetic data”.
It seems to me that retinal waves just don’t carry all that many bits of information, not enough for it to be really spiritually similar to the ML notion of “pretraining a model”, or to explain why current ML models need so much more data. I mean, compare the information content of retinal waves versus the information content of an adult human brain, or even adult rat brain. It has to be a teeny tiny fraction of a percent.
If I was talking about the role of evolution, I would talk instead about how evolution designed the learning algorithm, and inference algorithm, and neural architecture, and hyperparameter settings (some of which are dynamic depending on both where-we-are-in-development and moment-by-moment arousal level etc.), and especially the reward function. And then I would throw “oh and evolution also designed synthetic data for pretraining” into a footnote or something. :-P
I’m not sure exactly why we disagree on this, or how to resolve it. You’re obviously more knowledgeable about the details here, I dunno.
Yeah, it’s a tricky situation for me. The thesis that spontaneous activity is important is very central to my research, so I have a lot of incentives to believe in it. And I’m also exposed to a lot of evidence in its favor. We should probably swap roles (I should argue against and you for the importance) to debias. In case you’re ever interested in trying that out (or in having an adversarial collaboration about this topic) let me know :)
But to sketch out my beliefs a bit further:
I believe that spontaneous activity is quite rich in information. Direct evidence for that comes from this study from 2011 where they find that the statistics of spontaneous activity and stimulus-evoked activity are quite similar and get more similar over development. Indirect evidence comes from modeling studies from our lab that show that cortical maps and the fine-scale organization of synapses can be set up through spontaneous activity/retinal waves alone. Other labs have shown that retinal waves can set up long-range connectivity within the visual cortex and that they can produce Gabor receptive fields and with even more complex invariant properties. And beyond the visual cortex, I’m currently working on a project where we set up the circuitry for multisensory integration with only spontaneous activity.
I believe that the cortex essentially just does some form of gradient descent/backpropagation in canonical neural circuits that updates internal models. (The subcortex might be different.) I define “gradient descent” generously as “any procedure that uses or approximates the gradient of a loss function as the central component to reduce loss”. All the complications stem from the fact that a biological neural net is not great at accurately propagating the error signal backward, so evolution came up with a ton of tricks & hacks to make it work anyhow (see this paper from UCL & Deepmind for some ideas on how exactly). I have two main reasons to believe this:
Gradient descent is pretty easy to implement with neurons and simultaneously general that just on a complexity prior it’s a strong candidate for any solution that a meta-optimizer like evolution might come up with. Anything more complicated would not be working as robustly across all relevant domains.
In conjunction with what I believe about spontaneous activity inducing very strong & informative priors, I don’t think there is any need for anything more complicated than gradient descent. At least I don’t intuitively see the necessity of more optimized learning algorithms (except to maybe squeeze out a few more percentage points of performance).
I notice that there are a lot fewer green links in the second point, which also nicely indicates my relative level of certainty about that compared to the first point.
Thanks! Oh it’s fine, we can just have a normal discussion. :) Just let me know if I’m insulting your work or stressing you out. :-)
Sure. But real-world sensory data is quite rich in information too. I guess my question is: What’s the evidence that the spontaneous activity / “synthetic data” (e.g. retinal waves) is doing things that stimulated activity / “actual data” (e.g. naturalistic visual scenes) can’t do by itself? E.g. “the statistics of spontaneous activity and stimulus-evoked activity are quite similar and get more similar over development” seems to be evidence against the importance of the data being synthetic, because it suggests that actual data would also work equally well. So that would be the “shave a few days off development” story.
The brain (well, cortex & cerebellum, not so much the brainstem or hypothalamus) does “online learning”. So my “prior” keeps getting better and better. So right now I’m ~1,200,000,000 seconds old, and if I see some visual stimulus right now, the “prior” that I use to process that visual stimulus is informed by everything that I’ve learned in the previous 1,199,999,999 seconds of life, oh plus the previous 21,000,000 seconds in the womb (including retinal waves), plus whatever “prior” you think was hardcoded by the genome (e.g. cortico-cortico connections between certain pairs of regions are more likely to form than other pairs of regions, just because they’re close together and/or heavily seeded at birth with random connections).
Anyway, the point is, I’m not sure if you’re personally doing this, but I do sometimes see a tendency to conflate “prior” with “genetically-hardcoded information”, especially within the predictive processing literature, and I’m trying to push back on that. I agree with the generic idea that “priors are very important” but that doesn’t immediately imply that the things your cortex learns in 10 days (or whatever) of retinal waves are fundamentally different from and more important than the things your cortex learns in the subsequent 10 days of open-eye naturalistic visual stimulation. I think it’s just always true that the first 10 days of data are the prior for the 11th day, and the first 11 days of data are the prior for the 12th day, and the first 12 days of data … etc. etc. And in any particular case, that prior data may be composed of exogenous data vs synthetic data vs some combination of both, but whatever, it’s all going into the same prior either way.
I don’t think direct evidence for this exists. Tbf, this would be a very difficult experiment to run (you’d have to replace retinal waves with real data and the retina really wants to generate retinal waves).
But the principled argument that sways me the most is that “real” input is external—its statistics don’t really care about the developmental state of the animal. Spontaneous activity on the other hand changes with development and can (presumably) provide the most “useful” type of input for refining the circuit (as in something like progressive learning). This last step is conjecture and could be investigated with computational models (train the first layer with very coarse retinal waves, the second layer with more refined retinal waves, etc. and see how well the final model performs compared with one trained on an equal number of natural images). I might run that experiment at some point in the future. Any predictions?
Hmm, so I agree with the general point that you’re making that “priors are not set in stone” and the whole point is to update on them with sensory data and everything. But I think it’s not fair to treat all seconds of life as equally influential/important for learning. There is a lot of literature demonstrating that the cortex is less plastic during adulthood compared to development. There is also the big difference that during development the location & shape of dendrites and axons change depending on activity, while in adulthood things are a lot more rigid. Any input provided early on will have a disproportionate impact. The classic theory that there are critical periods of plasticity during development is probably too strong (given the right conditions/pharmacological interventions also the adult brain can be very plastic again), but still—there is something special about development.
I’m not sure if that’s the point that people in predictive coding are making or if they are just ignorant that lifelong plasticity is a thing.
Thanks!! :-)
I agree and didn’t mean to imply otherwise.
In terms of what we’re discussing here, I think it’s worth noting that there’s a big overlap between “sensitive windows in such-and-such part of the cortex” and “the time period when the data is external not synthetic”.
I dunno….
O’Reilly (1,2) simulated visual cortex development, and found that their learning algorithm flailed around and didn’t learn anything, unless they set it up to learn the where pathway first (with the what pathway disconnected), and only connect up the what pathway after the where pathway training has converged to a good model. (And they say there’s biological evidence for this.) (They didn’t have any retinal waves, just “real” data.)
As that example illustrates, there’s always a risk that a randomly-initialized model won’t converge to a good model upon training, thanks to a bad draw of the random seed. I imagine that there are various “tricks” that reduce the odds of this problem occurring—i.e. to make the loss landscape less bumpy, or something vaguely analogous to that. O’Reilly’s “carefully choreographed (and region-dependent) learning rates” is one such trick. I’m very open-minded to the possibility that “carefully choreographed synthetic data” is another such trick.
Anyway, I don’t particularly object to the idea “synthetic data is useful, and plausibly if you take an existing organism and remove its synthetic data it would get messed up”. I was objecting instead to the idea “synthetic data is a major difference between the performance of brains and deep RL, and thus maybe with the right synthetic data pre-training, deep RL would perform as well as brains”. I think the overwhelming majority of training on human brains involves real data—newborns don’t have object permanence or language or conceptual reasoning or anything like that, and presumably they build all those things out of a diet of actual not synthetic data. And even if you think that the learning algorithm of brains and deep RL is both gradient descent, the inference algorithm is clearly different (e.g. brains use analysis-by-synthesis), and the architectures are clearly different (e.g. brains are full of pairs of neurons where each projects to the other, whereas deep neural nets almost never have that). These are two fundamental differences that persist for the entire lifetime / duration of training, unlike synthetic data which only appears near the start. Also, the ML community has explored things like deep neural net weight initialization and curriculum learning plenty, I would just be very surprised if massive transformative performance improvements (like a big fraction of the difference between where we are and AGI) could come out of those kinds of investigation, as opposed to coming out of different architectures and learning algorithms and training data.