Thanks for the thought-provoking post! Let me try...
We have a visual system, and it (like everything in the neocortex) comes with an interface for “querying” it. Like, Dileep George gives the example “I’m hammering a nail into a wall. Is the nail horizontal or vertical?” You answer that question by constructing a visual model and then querying it. Or more simply, if I ask you a question about what you’re looking at, you attend to something in the visual field and give an answer.
Dileep writes: “An advantage of generative PGMs is that we can train the model once on all the data and then, at test time, decide which variables should act as evidence and which variables should act as targets, obtaining valid answers without retraining the model. Furthermore, we can also decide at test time that some variables fall in neither of the previous two categories (unobserved variables), and the model will use the rules of probability to marginalize them out.” (ref) (I’m quoting that verbatim because I’m not an expert on this stuff and I’m worried I’ll say something wrong. :-P )
Anyway, I would say that the word “I” is generally referring to the goings-on in the global workspace circuits in the brain, which we can think of as hierarchically above the visual system. The workspace can query the visual system, basically by sending a suite of top-down constraints into the visual system PGM (“there’s definitely a vertical line here!” or whatever), allowing the visual system to do its probabilistic inference, and then branching based on the status of some other visual system variable(s).
So in everyday terms we say “When someone asks me a question that’s most easily answered by visualizing something or thinking visually or attending to what I’m looking at, that’s what I do!” Whereas in fancypants terms we would describe the same thing as “In certain situations, the global workspace curcuits have learned (probably via RL) that it is advantageous to query the visual system in certain ways.”
So over the course of our lives, we (=global workspace circuits) learn the operation “figure out whether one thing is darker than another thing” as a specific way to query the visual system. And the checker shadow illusion has the fun property that when we query the visual system this way, it gives the wrong answer. We can still “know” the right answer by inferring it through a path that does not involve querying the visual system. Maybe it goes through abstract knowledge instead. And I guess your #3 (“I can occasionally and briefly get my brain to recognize A and B as the same shade”) probably looks like a really convoluted visual system query that involves forcing a bunch of PGM variables in unusual coordinated ways that prevent the shade-corrector from activating, or something like that.
What about fixing the mistake? Well, I think the global workspace has basically no control over how the visual system PGM is wired up internally, not only because the visual system has its own learning algorithm that involves minimizing prediction error, not maximizing reward, but also for the simpler reason that it gets locked down and stops learning at a pretty young age, I think. The global workspace can learn new queries, but it might be that there just isn’t any way to query the wired-up adult visual system to return the information you want (raw shade comparison). Or maybe with more practice you can get better at your convoluted #3 query...
Anyway, I would say that the word “I” is generally referring to the goings-on in the global workspace circuits in the brain, which we can think of as hierarchically above the visual system. The workspace can query the visual system, basically by sending a suite of top-down constraints into the visual system PGM (“there’s definitely a vertical line here!” or whatever), allowing the visual system to do its probabilistic inference, and then branching based on the status of some other visual system variable(s).
Why is a query represented as an overconfident false belief?
How would you query low-level details from a high-level node? Don’t the hierarchically high-up nodes represent things which range over longer distances in space/time, eliding low-level details like lines?
How would you query low-level details from a high-level node? Don’t the hierarchically high-up nodes represent things which range over longer distances in space/time, eliding low-level details like lines?
My explanation would be: it’s not a strict hierarchy, there are plenty of connections from the top to the bottom (or at least near-bottom). “Feedforward and feedback projections between regions typically connect to multiple levels of the hierarchy” “It has been estimated that 40% of all possible region-to-region connections actually exist which is much larger than a pure hierarchy would suggest.” (ref) (I’ve heard it elsewhere too.) Also, we need to do compression (throw out information) to get from raw input to top-level, but I think a lot of that compression is accomplished by only attending to one “object” at a time, rapidly flitting from one to another. I’m not sure how far that gets you, but at least it’s part of the story I think, in that it reduces the need to throw out low-level details. Another thing is saccades: maybe you can’t make high-level predictions about literally every cortical column in V1, but if you can access a subset of columns, then saccades can fill in the gaps.
Why is a query represented as an overconfident false belief?
I have pretty high confidence that “visual imagination” is accessing the same world-model database and machinery as “parsing a visual scene” (and likewise “imagining a sound” vs “parsing a sound”, etc.) I find it hard to imagine any alternative to that. Like it doesn’t seem plausible that we have two copies of this giant data structure and machinery and somehow keep them synchronized. And introspectively, it does seem to be true that there’s some competition where it’s hard to simultaneously imagine a sound while processing incoming sounds etc.—I mean, it’s always hard to do two things at once, but this seems especially hard.
So then the question is: how can you imagine seeing something that isn’t there, without the imagination being overruled by bottom-up sensory input? I guess there has to be some kind of mechanism that allows this, like a mechanism by which top-level processing can choose to prevent (a subset of) sensory input from having its usual strong influence on (a subset of) the network. I don’t know what that mechanism is.
I have pretty high confidence that “visual imagination” is accessing the same world-model database and machinery as “parsing a visual scene” (and likewise “imagining a sound” vs “parsing a sound”, etc.)
Update: Oops! I just learned that what I said there is kinda wrong.
What I should have said was: the machinery / database used for “visual imagination” is a subset of the machinery / database used for “parsing a visual scene”.
…But it’s a strict subset. Low-level visual processing is all about taking the massive flood of incoming retinal data and distilling it into a more manageable subspace of patterns, and that low-level machinery is not useful for visual imagination. See: visual mental imagery engages the left fusiform gyrus, but not the [occipital lobe].
(To be clear, the occipital lobe is not involved at inference time. The occipital lobe is obviously involved when the left fusiform gyrus is first learning its vocabulary of visual patterns.)
I don’t think that affects anything else in the conversation, just wanted to set the record straight. :)
I don’t significantly disagree, but I feel uneasy about a few points.
Theories of the sort I take you to be gesturing at often emphasize this nice aspect of their theory, that bottom-up attention (ie attention due to interesting stimulus) can be more or less captured by surprise, IE, local facts about the shifts in probabilities.
I agree that this seems to be a very good correlate of attention. However, the surprise itself wouldn’t seem to be the attention.
Surprise points merit extra computation. In terms of belief prop, it’s useful to prioritize the messages which are creating the biggest belief shifts. The brain is parallel, so you might think all messages get propagated regardless, but of course, the brain also likes to conserve resources. So, it makes sense that there’d be a mechanism for prioritizing messages.
Yet, message prioritization (I believe) does not account adequately for our experience.
There seems to be an additional mechanism which places surprising content into the global workspace (at least, if we want to phrase this in global workspace theory).
What if we don’t like global workspace theory?
Another idea that I think about here is: the brain’s “natural grammar” might be a head grammar. This is the fancy linguistics thing which sort of corresponds to the intuitive concept of “the key word in that sentence”. Parsing consists not only of grouping words together hierarchically into trees, but furthermore, whenever words are grouped, promoting one of them to be the “head” of that phrase.
In terms of a visual hierarchy, this would mean “some low level details float to the top”.
This would potentially explain why we can “see low-level detail” even if we think the rest of the brain primarily consumes the upper layers of the visual hierarchy. We can focus on individual leafs, even while seeing the whole tree as a tree, because we re-parse the tree to make that leaf the “head”. We see a leaf with a tree attached.
Maybe.
Without a mechanism like this, we could end up somewhat trapped into the high-level descriptions of what we see, leaving artists unable to invent perspective drawings, and so on.
bottom-up attention (ie attention due to interesting stimulus) can be more or less captured by surprise
Hmm. That’s not something I would have said.
I guess I think of two ways that sensory inputs can impact top-level processing.
First, I think sensory inputs impact top-level processing when top-level processing tries to make a prediction that is (directly or indirectly) falsified by the sensory input, and that prediction gets rejected, and top-level processing is forced to think a different thought instead.
If top-level processing is “paying close attention to some aspect X of sensory input”, then that involves “making very specific predictions about aspect X of sensory input”, and therefore the predictions are going to keep getting falsified unless they’re almost exactly tracking the moment-to-moment status of X.
On the opposite extreme, if top-level processing is “totally zoning out”, then that involves “not making any predictions whatsoever about sensory input”, and therefore no matter what the sensory input is, top-level processing can carry on doing what it’s doing.
In between those two extremes, we get the situation where top-level processing is making a pretty generic high-level prediction about sensory input, like “there’s confetti on the stage”. If the confetti suddenly disappeared altogether, it would falsify the top-level hypothesis, triggering a search for a new model, and being “noticed”. But if the detailed configuration of the confetti changes—and it certainly will—it’s still compatible with the top-level prediction “there’s confetti on the stage” being true, and so top-level processing can carry on doing what it’s doing without interruption.
So just to be explicit, I think you can have a lot of low-level surprise without it impacting top-level processing. In the confetti example, down in low-level V1, the cortical columns are constantly being surprised by the detailed way that each piece of confetti jiggles around as it falls, I think, but we don’t notice if we’re not paying top-down attention.
The second way that I think sensory inputs can impact top-level processing is by a very different route, something like sensory input → amygdala → hypothalamus → top-level processing. (I’m not sure of all the details and I’m leaving some things out; more HERE.) I think this route is kinda an autonomous subsystem, in the sense that top-down processing can’t just tell it what to do, and it’s not trained on the same reward signal as top-level processing is, and the information can flow in a way that totally bypasses top-level processing. The amygdala is trained (by supervised learning) to activate when detecting things that have immediately preceded feelings of excitement / scared / etc. previously in life, and the hypothalamus is running some hardcoded innate algorithm, I think. (Again, more HERE.) When this route activates, there’s a chain of events that results in the forcing of top-level processing to start paying attention to the corresponding sensory input (i.e. start issuing very specific predictions about the corresponding sensory input).
I guess it’s possible that there are other mechanisms besides these two, but I can’t immediately think of anything that these two mechanisms (or something like them) can’t explain.
What if we don’t like global workspace theory?
I dunno, I for one like global workspace theory. I called it “top-level processing” in this comment to be inclusive to other possibilities :)
Thanks for the thought-provoking post! Let me try...
We have a visual system, and it (like everything in the neocortex) comes with an interface for “querying” it. Like, Dileep George gives the example “I’m hammering a nail into a wall. Is the nail horizontal or vertical?” You answer that question by constructing a visual model and then querying it. Or more simply, if I ask you a question about what you’re looking at, you attend to something in the visual field and give an answer.
Dileep writes: “An advantage of generative PGMs is that we can train the model once on all the data and then, at test time, decide which variables should act as evidence and which variables should act as targets, obtaining valid answers without retraining the model. Furthermore, we can also decide at test time that some variables fall in neither of the previous two categories (unobserved variables), and the model will use the rules of probability to marginalize them out.” (ref) (I’m quoting that verbatim because I’m not an expert on this stuff and I’m worried I’ll say something wrong. :-P )
Anyway, I would say that the word “I” is generally referring to the goings-on in the global workspace circuits in the brain, which we can think of as hierarchically above the visual system. The workspace can query the visual system, basically by sending a suite of top-down constraints into the visual system PGM (“there’s definitely a vertical line here!” or whatever), allowing the visual system to do its probabilistic inference, and then branching based on the status of some other visual system variable(s).
So in everyday terms we say “When someone asks me a question that’s most easily answered by visualizing something or thinking visually or attending to what I’m looking at, that’s what I do!” Whereas in fancypants terms we would describe the same thing as “In certain situations, the global workspace curcuits have learned (probably via RL) that it is advantageous to query the visual system in certain ways.”
So over the course of our lives, we (=global workspace circuits) learn the operation “figure out whether one thing is darker than another thing” as a specific way to query the visual system. And the checker shadow illusion has the fun property that when we query the visual system this way, it gives the wrong answer. We can still “know” the right answer by inferring it through a path that does not involve querying the visual system. Maybe it goes through abstract knowledge instead. And I guess your #3 (“I can occasionally and briefly get my brain to recognize A and B as the same shade”) probably looks like a really convoluted visual system query that involves forcing a bunch of PGM variables in unusual coordinated ways that prevent the shade-corrector from activating, or something like that.
What about fixing the mistake? Well, I think the global workspace has basically no control over how the visual system PGM is wired up internally, not only because the visual system has its own learning algorithm that involves minimizing prediction error, not maximizing reward, but also for the simpler reason that it gets locked down and stops learning at a pretty young age, I think. The global workspace can learn new queries, but it might be that there just isn’t any way to query the wired-up adult visual system to return the information you want (raw shade comparison). Or maybe with more practice you can get better at your convoluted #3 query...
Not sure about all this...
Why is a query represented as an overconfident false belief?
How would you query low-level details from a high-level node? Don’t the hierarchically high-up nodes represent things which range over longer distances in space/time, eliding low-level details like lines?
My explanation would be: it’s not a strict hierarchy, there are plenty of connections from the top to the bottom (or at least near-bottom). “Feedforward and feedback projections between regions typically connect to multiple levels of the hierarchy” “It has been estimated that 40% of all possible region-to-region connections actually exist which is much larger than a pure hierarchy would suggest.” (ref) (I’ve heard it elsewhere too.) Also, we need to do compression (throw out information) to get from raw input to top-level, but I think a lot of that compression is accomplished by only attending to one “object” at a time, rapidly flitting from one to another. I’m not sure how far that gets you, but at least it’s part of the story I think, in that it reduces the need to throw out low-level details. Another thing is saccades: maybe you can’t make high-level predictions about literally every cortical column in V1, but if you can access a subset of columns, then saccades can fill in the gaps.
I have pretty high confidence that “visual imagination” is accessing the same world-model database and machinery as “parsing a visual scene” (and likewise “imagining a sound” vs “parsing a sound”, etc.) I find it hard to imagine any alternative to that. Like it doesn’t seem plausible that we have two copies of this giant data structure and machinery and somehow keep them synchronized. And introspectively, it does seem to be true that there’s some competition where it’s hard to simultaneously imagine a sound while processing incoming sounds etc.—I mean, it’s always hard to do two things at once, but this seems especially hard.
So then the question is: how can you imagine seeing something that isn’t there, without the imagination being overruled by bottom-up sensory input? I guess there has to be some kind of mechanism that allows this, like a mechanism by which top-level processing can choose to prevent (a subset of) sensory input from having its usual strong influence on (a subset of) the network. I don’t know what that mechanism is.
Update: Oops! I just learned that what I said there is kinda wrong.
What I should have said was: the machinery / database used for “visual imagination” is a subset of the machinery / database used for “parsing a visual scene”.
…But it’s a strict subset. Low-level visual processing is all about taking the massive flood of incoming retinal data and distilling it into a more manageable subspace of patterns, and that low-level machinery is not useful for visual imagination. See: visual mental imagery engages the left fusiform gyrus, but not the [occipital lobe].
(To be clear, the occipital lobe is not involved at inference time. The occipital lobe is obviously involved when the left fusiform gyrus is first learning its vocabulary of visual patterns.)
I don’t think that affects anything else in the conversation, just wanted to set the record straight. :)
I don’t significantly disagree, but I feel uneasy about a few points.
Theories of the sort I take you to be gesturing at often emphasize this nice aspect of their theory, that bottom-up attention (ie attention due to interesting stimulus) can be more or less captured by surprise, IE, local facts about the shifts in probabilities.
I agree that this seems to be a very good correlate of attention. However, the surprise itself wouldn’t seem to be the attention.
Surprise points merit extra computation. In terms of belief prop, it’s useful to prioritize the messages which are creating the biggest belief shifts. The brain is parallel, so you might think all messages get propagated regardless, but of course, the brain also likes to conserve resources. So, it makes sense that there’d be a mechanism for prioritizing messages.
Yet, message prioritization (I believe) does not account adequately for our experience.
There seems to be an additional mechanism which places surprising content into the global workspace (at least, if we want to phrase this in global workspace theory).
What if we don’t like global workspace theory?
Another idea that I think about here is: the brain’s “natural grammar” might be a head grammar. This is the fancy linguistics thing which sort of corresponds to the intuitive concept of “the key word in that sentence”. Parsing consists not only of grouping words together hierarchically into trees, but furthermore, whenever words are grouped, promoting one of them to be the “head” of that phrase.
In terms of a visual hierarchy, this would mean “some low level details float to the top”.
This would potentially explain why we can “see low-level detail” even if we think the rest of the brain primarily consumes the upper layers of the visual hierarchy. We can focus on individual leafs, even while seeing the whole tree as a tree, because we re-parse the tree to make that leaf the “head”. We see a leaf with a tree attached.
Maybe.
Without a mechanism like this, we could end up somewhat trapped into the high-level descriptions of what we see, leaving artists unable to invent perspective drawings, and so on.
Hmm. That’s not something I would have said.
I guess I think of two ways that sensory inputs can impact top-level processing.
First, I think sensory inputs impact top-level processing when top-level processing tries to make a prediction that is (directly or indirectly) falsified by the sensory input, and that prediction gets rejected, and top-level processing is forced to think a different thought instead.
If top-level processing is “paying close attention to some aspect X of sensory input”, then that involves “making very specific predictions about aspect X of sensory input”, and therefore the predictions are going to keep getting falsified unless they’re almost exactly tracking the moment-to-moment status of X.
On the opposite extreme, if top-level processing is “totally zoning out”, then that involves “not making any predictions whatsoever about sensory input”, and therefore no matter what the sensory input is, top-level processing can carry on doing what it’s doing.
In between those two extremes, we get the situation where top-level processing is making a pretty generic high-level prediction about sensory input, like “there’s confetti on the stage”. If the confetti suddenly disappeared altogether, it would falsify the top-level hypothesis, triggering a search for a new model, and being “noticed”. But if the detailed configuration of the confetti changes—and it certainly will—it’s still compatible with the top-level prediction “there’s confetti on the stage” being true, and so top-level processing can carry on doing what it’s doing without interruption.
So just to be explicit, I think you can have a lot of low-level surprise without it impacting top-level processing. In the confetti example, down in low-level V1, the cortical columns are constantly being surprised by the detailed way that each piece of confetti jiggles around as it falls, I think, but we don’t notice if we’re not paying top-down attention.
The second way that I think sensory inputs can impact top-level processing is by a very different route, something like sensory input → amygdala → hypothalamus → top-level processing. (I’m not sure of all the details and I’m leaving some things out; more HERE.) I think this route is kinda an autonomous subsystem, in the sense that top-down processing can’t just tell it what to do, and it’s not trained on the same reward signal as top-level processing is, and the information can flow in a way that totally bypasses top-level processing. The amygdala is trained (by supervised learning) to activate when detecting things that have immediately preceded feelings of excitement / scared / etc. previously in life, and the hypothalamus is running some hardcoded innate algorithm, I think. (Again, more HERE.) When this route activates, there’s a chain of events that results in the forcing of top-level processing to start paying attention to the corresponding sensory input (i.e. start issuing very specific predictions about the corresponding sensory input).
I guess it’s possible that there are other mechanisms besides these two, but I can’t immediately think of anything that these two mechanisms (or something like them) can’t explain.
I dunno, I for one like global workspace theory. I called it “top-level processing” in this comment to be inclusive to other possibilities :)