There are endless fractal-like depths of complexity in rocks, and there are endless fractal-like depths of complexity in ants, and there are endless fractal-like depths of complexity in birdsong, and there are endless fractal-like depths of complexity in the shape of trees, etc. So “follow the gradient where you’re learning new things” by itself seems wildly under-constrained.
There is not ‘endless fractal-like depths of complexity’ in the retinal images of rocks or ants or trees, which is what is actually relevant here. For any model, a flat uniform color wall has near zero compressible complexity, as does a picture of noise (max entropy but it’s not learnable). Real world images have learnable complexity which crucially varies based on both the image and the model’s current knowledge. But it’s never “endless”: generally it’s going to be on order or less than the image entropy the cortex gets from the retina, which is comparable to compression with modern codecs.
You cite video-game ML papers,
Actually in this thread I cited neurosci papers: first that curiosity/info-value is a reward processed like hunger[1], and a review article from 2020[2] which is an update from a similar 2015 paper[3].
So “follow the gradient where you’re learning new things” by itself seems wildly under-constrained.
Sure—but curiosity/info-gain obviously isn’t all of reward, so the various other components can also steer behavior paths towards fitness relevant directions, which then can indirectly biases the trajectory of the curiosity-driven learning as it’s always relevant to the models’ current knowledge and thus the experience trajectory.
Therefore, I am skeptical of any curiosity / novelty drive completely divorced from “hardcoded” drives that induce disproportionate curiosity / interest in certain specific types of things (e.g. human speech sounds) over other things (e.g. the detailed coloration of pebbles).
Recall that the infant is mostly driven by subcortical structures and innate patterns for the early years, and during all this time it is absolutely bombarded with human speech as the primary consistent complex audio signal. There may be some attentional bias towards human speech, but it may not be necessary, as there aren’t many other audio streams that could come close to competing. Birdsong is both less interesting and far less pervasive/frequent for most children’s audio experience. Also ‘visually interesting pebbles’ don’t seem that different than other early children’s toys: seems children would find them interesting (although there shape is typically boring).
I strongly disagree with the idea that SC and cortex are doing similar things.
I didn’t say they did—I said I’m aware of two proposals for an innate learning bias for faces: CPGs pretraining the viscortex in the womb, and innate attentional bias circuits in the SC. These are effectively doing a similar thing.
We need ground truth somehow, and my claim is that SC provides it. So my mainline expectation is that SC gets visual information in a way that bypasses the cortex altogether. T
For attentional bias/shaping the SC likely can only support very simple pattern biases close to a linear readout. So a simple bias to attend to faces seems possible, but I was actually talking about sexual attraction when I said:
But clearly there most be some innate visual shaping for at least sexual attraction—so evidence that SC drives that would also be good evidence it drives some landscape biasing, for example. But it seems like those reward shapers would need to be looking primarily at higher visual cortex rather than V1.
For sexual attraction the patterns are just too complex, so they are represented in IT or similar higher visual cortex. Any innate circuit that references human body shape images and computes their sexual attraction must get that input from higher viscortex—which then leads to the whole symbol grounding problem—as you point out, and I naturally agree.
But regardless of the specific solution to the symbol grounding problem, a consequence of that solution is that the putative brain region computing attraction value of images of humanoid shapes would need to compute that primarily from higher viscortex/IT input.
I think your model may be something like the SC computes an attentional bias which encodes all the innate sexiness geometry and thus guides us to spend more time saccading at sexy images rather than others, and possibly also outputs some reward info for this.
But that could not work as stated, simply because the sexiness concept is very complex and requires a deepnet to compute (symmetry, fat content, various feature ratios, etc).
Also this must be true, because otherwise we wouldn’t see the failures of sexual imprinting in birds that we do in fact observe.
So how can the genome best specify a complex concept innately using the least number of bits? Just indexing neurons in the learned cortex directly would be bit-minimal, but as you point out that isn’t robust.
However the topographic organization of cortex can help, as it naturally clusters neurons semantically.
Another way to ‘locate’ specific learned neurons more robustly is through proxy matching, where you have dumb simple humanoid shape detectors and symmetry detectors etc encoding a simple sexiness visual concept—which could potentially be in the SC. But then during some critical window the firing patterns of those proxy circuits are used to locate the matching visual concept in visual cortex and connect to that. In other words, you can use the simple innate proxy circuit to indirectly locate a cluster of neurons in cortex, simply based on firing pattern correlation.
This allows the genome to link to a high complex concept by specifying a low complexity proxy match for that concept in its earlier low complexity larval stage.
Proxy matching implies that after critical period training whatever neurons represents innate sexiness must then shift to get their input from higher viscortex: IT rather than just V1, and certainly not LGN.
Another related possibility is that the SC is just used to create the initial bootstrapping signal, and then some other brain region actually establishes the connection to innate downstream dependencies of sexiness and learned sexiness—so separating out the sexiness proxy from the used sexiness concept.
Anyway my point was more that innate sexual attraction must be encoded somewhere, and any evidence that the SC is crucially involved with that is evidence it is crucially involved with other innate visual bias/shaping.
Thanks again for taking the time to chat, I am finding this super helpful in understanding where you’re coming from.
Your description of proxy matching is a close match to what I’m thinking. (Sorry if I’ve been describing it poorly!)
I think I got confused because I’m mapping it to neuroanatomy differently than you. I think SC is the “proxy” part, but the “matching” part is somewhere else, not SC. For example, it might look something like this:
SC calculates a proxy, based on LGN→SC inputs. The output of this calculation is a signal, which I’ll call “Fits proxy?”
The “Fits proxy?” signal then gets sent (indirectly) to a certain part of the amygdala, where it’s used as a “ground truth” for supervised learning.
This part of the amygdala builds a trained model. The input to the trained model is (mostly-high-level) visual information, especially from IT. The output of the trained model is some signal, which I’ll call “Fits model?”
The SC’s “Fits proxy?” signal and the amygdala’s “Fits model?” signal both go down to the hypothalamus and brainstem, possibly hitting the very same neurons that trigger specific innate reactions.
Optionally, the trained model in the amygdala could stop updating itself after a critical period early in life.
Also optionally, as the animal gets older, the “Fits proxy?” signal could have less and less influence on those innate reaction neurons in the hypothalamus & brainstem, while the “Fits model?” signal would have more and more influence.
(This is one example; there are a bunch of other variations on this theme, including ones where you replace “part of the amygdala” with other parts of the forebrain like nucleus accumbens shell or lateral septum, and also where the proxy is coming from other places besides SC.)
(This is a generalization of “calculating correlations”. If the amygdala trained model is only one “layer” in the deep learning sense, then it would be just calculating linear correlations between IT signals and the proxy, I think. My best guess is that the amygdala is learning a two-layer feedforward model (more or less), so a bit more complicated than linear correlations, although low confidence on that.)
Again, since the trained model is in the amygdala, not SC, there’s no need to “shift” the SC’s inputs to IT. That’s why I was confused by what you wrote. :)
Hey thanks for explaining this—makes sense to me and I think we are mostly in agreement. Using the proxy signal as a supervised learning target to recognize the learned target pattern in IT is a straightforward way to implement the matching, but probably not quite complete in practice. I suspect you also need to combine that with some strong priors to correctly carve out the target concept.
Consider the equivalent example of trying to train a highly accurate cat image detector given a dataset containing say 20% cats combined with a crappy low complexity proxy cat detector to provide the labels. Can you really bootstrap improve discriminative models in that way with non-trivial proxy label noise? I suspect that the key to making this work is using the powerful generative model of the cortex as a regularizer, so you train it to recognize images the proxy detector labels as cats that are also close to the generative model’s data manifold. If you then reoptimize (in evolutionary time) the proxy detector to leverage that I think it makes the problem much more tractable. The generative model allows you to make the learned model far more selective around the actual data manifold to increase robustness. In very simple vague terms the model would then be learning the combination of high proxy probability combined with low distance to the data manifold of examples from the critical training set.
Later if you then test OoD on vague non-cats (dogs, stuffed animals) not encountered in training that would confuse the simple proxy the learned model can reject those—even though it never saw them during critical training—simply because they are far from the generative manifold, and the learned model is ‘shrunk’ to fit that manifold.
I do agree the amygdala does seem like a good fit for the location of the learned symbol circuit, although at that point it raises the question of why not also just have the proxy in the amygdala? If the amygdala has the required inputs from LGN and/or V1 it would be my guess that it could also just colocate the innate proxy circuit. (I haven’t looked in the lit to see if those connections exist)
Also 6 seems required for the system to work as well in adulthood as it typically does, and yet also explain the out of distribution failures for imprinting etc. (Once the IT representation is learned you want to use that exclusively, as it should be strictly superior to the proxy circuit. This seems a little weird at first, but the)
The hope is that this same mechanism which seems well suited for handling imprinting also works for grounding sexual attraction (as an elaboration of imprinting) and then more complex concepts like representations of other’s emotions from facial expression, vocal tone, etc proxies, and then combining that with empathic simulation to ground a model of other’s values/utility for social game theory, altruism, etc.
The hope is that this same mechanism which seems well suited for handling imprinting also works for grounding sexual attraction (as an elaboration of imprinting) and then more complex concepts like representations of other’s emotions from facial expression, vocal tone, etc proxies, and then combining that with empathic simulation to ground a model of other’s values/utility for social game theory, altruism, etc.
Yes, that is my hope too! And the main thing I’m working on most days is trying to flesh out the details.
I do agree the amygdala does seem like a good fit for the location of the learned symbol circuit, although at that point it raises the question of why not also just have the proxy in the amygdala? If the amygdala has the required inputs from LGN and/or V1 it would be my guess that it could also just colocate the innate proxy circuit. (I haven’t looked in the lit to see if those connections exist)
For example, I claim that all the vision-related inputs to the amygdala have at some point passed through at least one locally-random filter stage (cf. “pattern separation” in neuro literature or “compressed sensing” in DSP literature). That’s perfectly fine if the amygdala is just going to use those inputs as feedstock for an SL model. SL models don’t need to know a priori which input neuron is representing which object-level pattern, because it’s going to learn the connections, so if there’s some randomness involved, it’s fine. But the randomness would be a very big problem if the amygdala needs to use those input signals to calculate a ground-truth proxy.
As another example, a ground-truth proxy requires zero adjustable parameters (because how would you adjust them?), whereas a learning algorithm does well with as many adjustable parameters as possible, more or less.
So I see these as very different algorithmic tasks—so different that I would expect them to wind up in different parts of the brain, just on general principles.
The amygdala is a hodgepodge grouping of nuclei, some of which are “really” (embryologically & evolutionarily) part of the cortex, and the rest of which are “really” part of the striatum (ref). So if we’re going to say that the cortex and striatum are dedicated to running within-lifetime learning algorithms (which I do say), then we should expect the amygdala to be in that same category too.
By contrast, SC is in the brainstem, and if you go far enough back, SC is supposedly a cousin of the part of the pre-vertebrate (e.g. amphioxus) nervous system that implements a simple “escape circuit” by triggering swimming when it detects a shadow—in other words, a part of the brain that triggers an innate reaction based on a “hardcoded” type of pattern in visual input. So it would make sense to say that the SC is still more-or-less doing those same types of calculations.
There is not ‘endless fractal-like depths of complexity’ in the retinal images of rocks or ants or trees, which is what is actually relevant here. For any model, a flat uniform color wall has near zero compressible complexity, as does a picture of noise (max entropy but it’s not learnable). Real world images have learnable complexity which crucially varies based on both the image and the model’s current knowledge. But it’s never “endless”: generally it’s going to be on order or less than the image entropy the cortex gets from the retina, which is comparable to compression with modern codecs.
Actually in this thread I cited neurosci papers: first that curiosity/info-value is a reward processed like hunger[1], and a review article from 2020[2] which is an update from a similar 2015 paper[3].
Sure—but curiosity/info-gain obviously isn’t all of reward, so the various other components can also steer behavior paths towards fitness relevant directions, which then can indirectly biases the trajectory of the curiosity-driven learning as it’s always relevant to the models’ current knowledge and thus the experience trajectory.
Recall that the infant is mostly driven by subcortical structures and innate patterns for the early years, and during all this time it is absolutely bombarded with human speech as the primary consistent complex audio signal. There may be some attentional bias towards human speech, but it may not be necessary, as there aren’t many other audio streams that could come close to competing. Birdsong is both less interesting and far less pervasive/frequent for most children’s audio experience. Also ‘visually interesting pebbles’ don’t seem that different than other early children’s toys: seems children would find them interesting (although there shape is typically boring).
I didn’t say they did—I said I’m aware of two proposals for an innate learning bias for faces: CPGs pretraining the viscortex in the womb, and innate attentional bias circuits in the SC. These are effectively doing a similar thing.
For attentional bias/shaping the SC likely can only support very simple pattern biases close to a linear readout. So a simple bias to attend to faces seems possible, but I was actually talking about sexual attraction when I said:
For sexual attraction the patterns are just too complex, so they are represented in IT or similar higher visual cortex. Any innate circuit that references human body shape images and computes their sexual attraction must get that input from higher viscortex—which then leads to the whole symbol grounding problem—as you point out, and I naturally agree.
But regardless of the specific solution to the symbol grounding problem, a consequence of that solution is that the putative brain region computing attraction value of images of humanoid shapes would need to compute that primarily from higher viscortex/IT input.
I think your model may be something like the SC computes an attentional bias which encodes all the innate sexiness geometry and thus guides us to spend more time saccading at sexy images rather than others, and possibly also outputs some reward info for this.
But that could not work as stated, simply because the sexiness concept is very complex and requires a deepnet to compute (symmetry, fat content, various feature ratios, etc).
Also this must be true, because otherwise we wouldn’t see the failures of sexual imprinting in birds that we do in fact observe.
So how can the genome best specify a complex concept innately using the least number of bits? Just indexing neurons in the learned cortex directly would be bit-minimal, but as you point out that isn’t robust.
However the topographic organization of cortex can help, as it naturally clusters neurons semantically.
Another way to ‘locate’ specific learned neurons more robustly is through proxy matching, where you have dumb simple humanoid shape detectors and symmetry detectors etc encoding a simple sexiness visual concept—which could potentially be in the SC. But then during some critical window the firing patterns of those proxy circuits are used to locate the matching visual concept in visual cortex and connect to that. In other words, you can use the simple innate proxy circuit to indirectly locate a cluster of neurons in cortex, simply based on firing pattern correlation.
This allows the genome to link to a high complex concept by specifying a low complexity proxy match for that concept in its earlier low complexity larval stage.
Proxy matching implies that after critical period training whatever neurons represents innate sexiness must then shift to get their input from higher viscortex: IT rather than just V1, and certainly not LGN.
Another related possibility is that the SC is just used to create the initial bootstrapping signal, and then some other brain region actually establishes the connection to innate downstream dependencies of sexiness and learned sexiness—so separating out the sexiness proxy from the used sexiness concept.
Anyway my point was more that innate sexual attraction must be encoded somewhere, and any evidence that the SC is crucially involved with that is evidence it is crucially involved with other innate visual bias/shaping.
Shared striatal activity in decisions to satisfy curiosity and hunger at the risk of electric shocks
Systems neuroscience of curiosity
The psychology and neuroscience of curiosity
Thanks again for taking the time to chat, I am finding this super helpful in understanding where you’re coming from.
Your description of proxy matching is a close match to what I’m thinking. (Sorry if I’ve been describing it poorly!)
I think I got confused because I’m mapping it to neuroanatomy differently than you. I think SC is the “proxy” part, but the “matching” part is somewhere else, not SC. For example, it might look something like this:
SC calculates a proxy, based on LGN→SC inputs. The output of this calculation is a signal, which I’ll call “Fits proxy?”
The “Fits proxy?” signal then gets sent (indirectly) to a certain part of the amygdala, where it’s used as a “ground truth” for supervised learning.
This part of the amygdala builds a trained model. The input to the trained model is (mostly-high-level) visual information, especially from IT. The output of the trained model is some signal, which I’ll call “Fits model?”
The SC’s “Fits proxy?” signal and the amygdala’s “Fits model?” signal both go down to the hypothalamus and brainstem, possibly hitting the very same neurons that trigger specific innate reactions.
Optionally, the trained model in the amygdala could stop updating itself after a critical period early in life.
Also optionally, as the animal gets older, the “Fits proxy?” signal could have less and less influence on those innate reaction neurons in the hypothalamus & brainstem, while the “Fits model?” signal would have more and more influence.
(This is one example; there are a bunch of other variations on this theme, including ones where you replace “part of the amygdala” with other parts of the forebrain like nucleus accumbens shell or lateral septum, and also where the proxy is coming from other places besides SC.)
(This is a generalization of “calculating correlations”. If the amygdala trained model is only one “layer” in the deep learning sense, then it would be just calculating linear correlations between IT signals and the proxy, I think. My best guess is that the amygdala is learning a two-layer feedforward model (more or less), so a bit more complicated than linear correlations, although low confidence on that.)
Again, since the trained model is in the amygdala, not SC, there’s no need to “shift” the SC’s inputs to IT. That’s why I was confused by what you wrote. :)
Hey thanks for explaining this—makes sense to me and I think we are mostly in agreement. Using the proxy signal as a supervised learning target to recognize the learned target pattern in IT is a straightforward way to implement the matching, but probably not quite complete in practice. I suspect you also need to combine that with some strong priors to correctly carve out the target concept.
Consider the equivalent example of trying to train a highly accurate cat image detector given a dataset containing say 20% cats combined with a crappy low complexity proxy cat detector to provide the labels. Can you really bootstrap improve discriminative models in that way with non-trivial proxy label noise? I suspect that the key to making this work is using the powerful generative model of the cortex as a regularizer, so you train it to recognize images the proxy detector labels as cats that are also close to the generative model’s data manifold. If you then reoptimize (in evolutionary time) the proxy detector to leverage that I think it makes the problem much more tractable. The generative model allows you to make the learned model far more selective around the actual data manifold to increase robustness. In very simple vague terms the model would then be learning the combination of high proxy probability combined with low distance to the data manifold of examples from the critical training set.
Later if you then test OoD on vague non-cats (dogs, stuffed animals) not encountered in training that would confuse the simple proxy the learned model can reject those—even though it never saw them during critical training—simply because they are far from the generative manifold, and the learned model is ‘shrunk’ to fit that manifold.
I do agree the amygdala does seem like a good fit for the location of the learned symbol circuit, although at that point it raises the question of why not also just have the proxy in the amygdala? If the amygdala has the required inputs from LGN and/or V1 it would be my guess that it could also just colocate the innate proxy circuit. (I haven’t looked in the lit to see if those connections exist)
Also 6 seems required for the system to work as well in adulthood as it typically does, and yet also explain the out of distribution failures for imprinting etc. (Once the IT representation is learned you want to use that exclusively, as it should be strictly superior to the proxy circuit. This seems a little weird at first, but the)
The hope is that this same mechanism which seems well suited for handling imprinting also works for grounding sexual attraction (as an elaboration of imprinting) and then more complex concepts like representations of other’s emotions from facial expression, vocal tone, etc proxies, and then combining that with empathic simulation to ground a model of other’s values/utility for social game theory, altruism, etc.
Yes, that is my hope too! And the main thing I’m working on most days is trying to flesh out the details.
For example, I claim that all the vision-related inputs to the amygdala have at some point passed through at least one locally-random filter stage (cf. “pattern separation” in neuro literature or “compressed sensing” in DSP literature). That’s perfectly fine if the amygdala is just going to use those inputs as feedstock for an SL model. SL models don’t need to know a priori which input neuron is representing which object-level pattern, because it’s going to learn the connections, so if there’s some randomness involved, it’s fine. But the randomness would be a very big problem if the amygdala needs to use those input signals to calculate a ground-truth proxy.
As another example, a ground-truth proxy requires zero adjustable parameters (because how would you adjust them?), whereas a learning algorithm does well with as many adjustable parameters as possible, more or less.
So I see these as very different algorithmic tasks—so different that I would expect them to wind up in different parts of the brain, just on general principles.
The amygdala is a hodgepodge grouping of nuclei, some of which are “really” (embryologically & evolutionarily) part of the cortex, and the rest of which are “really” part of the striatum (ref). So if we’re going to say that the cortex and striatum are dedicated to running within-lifetime learning algorithms (which I do say), then we should expect the amygdala to be in that same category too.
By contrast, SC is in the brainstem, and if you go far enough back, SC is supposedly a cousin of the part of the pre-vertebrate (e.g. amphioxus) nervous system that implements a simple “escape circuit” by triggering swimming when it detects a shadow—in other words, a part of the brain that triggers an innate reaction based on a “hardcoded” type of pattern in visual input. So it would make sense to say that the SC is still more-or-less doing those same types of calculations.