Hmm, I still might not be following, but I’ll write something anyway. :)
Take some “concept” in your world-model, operationalized as a particular cluster C of neurons in some part of your cortex that tend to activate together.
How might we figure out what what C “means”?
One part of the answer is entirely within the cortex world-model: C has particular relationships to other things in the cortex world-model, which in term have relationships to still other things etc. Clusters of neurons related to “bird” have some connection to clusters of neurons related to “flying”. That by itself might already be enough to pin down the “meanings” of different things, just because there’s so much structure there, and we can try to match it up with structures in the world, by analogy with unsupervised machine translation. But if not…
The other part of the answer is about how the cortex world-model relates to the real world. Maybe C directly predicts some particular pattern in low-level sensory inputs. Maybe C directly activates some particular pattern in motor output. Or maybe the connection is less direct—a certain abstract pattern in the space of abstract patterns in the space of abstract patterns in the space of low-level sensory inputs, or whatever. If we look at naturalistic visual inputs that directly or indirectly trigger C, and they’re disproportionately pictures of clocks, then that’s some evidence that C “means” clock.
So, how about “cold”? Our body has a couple relevant sensors: peripheral nerves that express TRPM8 (“cold and menthol receptor 1”), hypothalamus neurons that detect blood temperature via TRPV1, etc. (I’m not an expert on the details.) As usual, these sensory signals are processed in two areas in parallel. In the hypothalamus & brainstem (“Steering Subsystem”), they trigger innate reactions like shivering, unpleasant feelings / desire to warm up, and so on. And in the cortex, they’re treated as just so many more channels of unlabeled input data that the world-model needs to predict.
In the course of predicting them well, the world-model invents some slightly-higher-level concept (or family of closely-interlinked concepts) that we call “cold”. And it notices and memorizes predictively-useful relationships between this new “cold” concept and other things in the world-model, e.g. shivering and ice.
I don’t think there’s more to the concept “cold” than the sum total of its associations with every other concept, with sensory input, and with motor output. And we can explain those latter associations via the structure of the world and body in conjunction with a learning algorithm running throughout your life experience.
You can sorta write code for a relevant part of what’s happening in the mind when e.g. the freezing emotion/sensation is triggered.
I like to draw the distinction between understanding learning algorithms and understanding trained models. The former is kinda like what you learn in an ML course (gradient descent, training data, etc.) , the latter is kinda like what you learn in a mechanistic interpretability paper. I don’t think it’s realistic to “write code” for the “cold” concept, because I think it (like all concepts) emerges at the trained model level. It emerges from a learning algorithm, training environment, loss function, etc.
Of course, we can chat about the trained model level to some extent. Why is “cold” associated with shivering? Because in the training environment of life experience, those two things have tended to go together, such that each provides nonzero Bayesian evidence that the other should be active, or will be soon. Ditto with the connection between cold and ice cream, and everything else. So we can chat about it, but it would take forever to directly write code for all those things. Hence the learning algorithm. Does that help?
In the course of predicting them well, the world-model invents some slightly-higher-level concept (or family of closely-interlinked concepts) that we call “cold”. And it notices and memorizes predictively-useful relationships between this new “cold” concept and other things in the world-model, e.g. shivering and ice.
I don’t think there’s more to the concept “cold” than the sum total of its associations with every other concept, with sensory input, and with motor output.
I also basically agree with:
I like to draw the distinction between understanding learning algorithms and understanding trained models. The former is kinda like what you learn in an ML course (gradient descent, training data, etc.) , the latter is kinda like what you learn in a mechanistic interpretability paper. I don’t think it’s realistic to “write code” for the “cold” concept, because I think it (like all concepts) emerges at the trained model level. It emerges from a learning algorithm, training environment, loss function, etc.
I agree that fully writing code would be quite a daunting task. I think my phrasing of “write code” was not great. But it’s already some reductionist progress if you have something like:
if coldness concept gets more activated: increase activation of shivering anticipation; weakly increase activation of snow concept; ...
I don’t think it’s a worthwhile exercise to get very precise.
An important point I wanted to make here is just that the meaning of “cold” comes from the interactions with other concepts, and there’s no such thing as an inherent independent meaning of the word “cold”. (So when I hear ‘If we look at naturalistic visual inputs that directly or indirectly trigger C, and they’re disproportionately pictures of clocks, then that’s some evidence that C “means” clock.’ this seems a bit off to me, though not too bad.)
I guess I best try to explain why I felt some unease with your initial description of the cold example:
Suppose somebody said:
There’s a certain kind of interoceptive sensory input, consisting of such-and-such signal coming from blah type of thermoreceptor in the peripheral nervous system. Your brain does its usual thing of transforming that sensation into its own “color” of “metaphysical paint” (as in §3.3.2) that forms a concept / property in your conscious awareness and world-model, and you know it by the everyday term “cold”.
On the one hand, I would defend this passage as basically true.
Basically I think that some people—though a priory not you—would think that sth like “i feel cold because the cold-thermorecepters activate the corresponding cold concept” explains their sense of cold. However, if you just take this hypothesis which basically is “some sensors activate some concept” without anything else, then the concept would be completely shapeless and uninterpretable—unrelated to anything known.
I now think you probably didn’t mean it in a nearly that bad way but not sure.
(But some parts of what you write seem to me like you have slightly weaker sensors about “how does a hypothesis actually constrain my anticipations / concentrate probability mass” or “what would this hypothesis predict if I didn’t already know how I perceive it”, and I do think those sensors are useful.)
(I also think that there is some hypothalamus-or-so buisness logic for what responses to trigger (e.g. shivers) from significant cold input signals that would need to be figured out if you want to get a good model of freezing/feeling-uncomfortably-cold, but that’s about freezing in particular and not temperature as a property we model on objects.)
Hmm, I still might not be following, but I’ll write something anyway. :)
Take some “concept” in your world-model, operationalized as a particular cluster C of neurons in some part of your cortex that tend to activate together.
How might we figure out what what C “means”?
One part of the answer is entirely within the cortex world-model: C has particular relationships to other things in the cortex world-model, which in term have relationships to still other things etc. Clusters of neurons related to “bird” have some connection to clusters of neurons related to “flying”. That by itself might already be enough to pin down the “meanings” of different things, just because there’s so much structure there, and we can try to match it up with structures in the world, by analogy with unsupervised machine translation. But if not…
The other part of the answer is about how the cortex world-model relates to the real world. Maybe C directly predicts some particular pattern in low-level sensory inputs. Maybe C directly activates some particular pattern in motor output. Or maybe the connection is less direct—a certain abstract pattern in the space of abstract patterns in the space of abstract patterns in the space of low-level sensory inputs, or whatever. If we look at naturalistic visual inputs that directly or indirectly trigger C, and they’re disproportionately pictures of clocks, then that’s some evidence that C “means” clock.
So, how about “cold”? Our body has a couple relevant sensors: peripheral nerves that express TRPM8 (“cold and menthol receptor 1”), hypothalamus neurons that detect blood temperature via TRPV1, etc. (I’m not an expert on the details.) As usual, these sensory signals are processed in two areas in parallel. In the hypothalamus & brainstem (“Steering Subsystem”), they trigger innate reactions like shivering, unpleasant feelings / desire to warm up, and so on. And in the cortex, they’re treated as just so many more channels of unlabeled input data that the world-model needs to predict.
In the course of predicting them well, the world-model invents some slightly-higher-level concept (or family of closely-interlinked concepts) that we call “cold”. And it notices and memorizes predictively-useful relationships between this new “cold” concept and other things in the world-model, e.g. shivering and ice.
I don’t think there’s more to the concept “cold” than the sum total of its associations with every other concept, with sensory input, and with motor output. And we can explain those latter associations via the structure of the world and body in conjunction with a learning algorithm running throughout your life experience.
I like to draw the distinction between understanding learning algorithms and understanding trained models. The former is kinda like what you learn in an ML course (gradient descent, training data, etc.) , the latter is kinda like what you learn in a mechanistic interpretability paper. I don’t think it’s realistic to “write code” for the “cold” concept, because I think it (like all concepts) emerges at the trained model level. It emerges from a learning algorithm, training environment, loss function, etc.
Of course, we can chat about the trained model level to some extent. Why is “cold” associated with shivering? Because in the training environment of life experience, those two things have tended to go together, such that each provides nonzero Bayesian evidence that the other should be active, or will be soon. Ditto with the connection between cold and ice cream, and everything else. So we can chat about it, but it would take forever to directly write code for all those things. Hence the learning algorithm. Does that help?
Thanks for communicating your model well again!
I think we might mostly agree, but let’s clarify.
I agree with all of:
I also basically agree with:
I agree that fully writing code would be quite a daunting task. I think my phrasing of “write code” was not great. But it’s already some reductionist progress if you have something like:
I don’t think it’s a worthwhile exercise to get very precise.
An important point I wanted to make here is just that the meaning of “cold” comes from the interactions with other concepts, and there’s no such thing as an inherent independent meaning of the word “cold”. (So when I hear ‘If we look at naturalistic visual inputs that directly or indirectly trigger C, and they’re disproportionately pictures of clocks, then that’s some evidence that C “means” clock.’ this seems a bit off to me, though not too bad.)
I guess I best try to explain why I felt some unease with your initial description of the cold example:
Basically I think that some people—though a priory not you—would think that sth like “i feel cold because the cold-thermorecepters activate the corresponding cold concept” explains their sense of cold. However, if you just take this hypothesis which basically is “some sensors activate some concept” without anything else, then the concept would be completely shapeless and uninterpretable—unrelated to anything known.
I now think you probably didn’t mean it in a nearly that bad way but not sure.
(But some parts of what you write seem to me like you have slightly weaker sensors about “how does a hypothesis actually constrain my anticipations / concentrate probability mass” or “what would this hypothesis predict if I didn’t already know how I perceive it”, and I do think those sensors are useful.)
(I also think that there is some hypothalamus-or-so buisness logic for what responses to trigger (e.g. shivers) from significant cold input signals that would need to be figured out if you want to get a good model of freezing/feeling-uncomfortably-cold, but that’s about freezing in particular and not temperature as a property we model on objects.)