Hierarchical predictive coding is interesting, but I have some misgivings that it does a good job explaining what we see of brain function, because brains seem to have really dramatic attention mechanisms.
By “attention” I don’t mean to imply much similarity to attention mechanisms in current machine learning. I partly mean that not all our cortex is going at full blast all the time—instead, activity is modulated dynamically, and this interacts in a very finely tuned way with the short-term stored state of high-level representations. It seems like there are adaptations in the real-time dynamics of the brain that are finely selected to do interesting and complicated things that I don’t understand well, rather than them trying to faithfully implement an algorithm that we think of as happening in one step.
Hierarchical predictive coding is interesting, but I have some misgivings that it does a good job explaining what we see of brain function, because brains seem to have really dramatic attention mechanisms.
By “attention” I don’t mean to imply much similarity to attention mechanisms in current machine learning. I partly mean that not all our cortex is going at full blast all the time—instead, activity is modulated dynamically, and this interacts in a very finely tuned way with the short-term stored state of high-level representations.
I’m not sure that this is an argument against predictive coding, because e.g. Surfing Uncertainty talks a lot about how attention fits together with predictive coding and how it involves dynamic modulation of activity.
In the book’s model, attention corresponds to “precision-weighting of prediction error”. An example might be navigating a cluttered room in dim versus bright lightning. If a room is dark and you don’t see much, you may be sensing your way around with your hands or trying to remember where everything is so you don’t run into things. Your attention is on your sense of touch, or your memory of the room. On the other hand, if you can see clearly, then you are probably mainly paying attention to your sense of sight, since that just lets you see where everything is.
Another way of putting this is that if the room is very dark, the sensory data generated by your vision has low precision (low confidence): it is not very useful for generating predictions of where everything is. Your sense of touch, as well as your previous memories, have higher precision than your sense of vision does. As a result, signals coming from the more useful sources are weighted more highly when attempting to model your environment; this is subjectively experienced as your attention being on your hands and previous recollections of the room. Conversely, when it gets brighter, information from your eyes now has higher precision, so will be weighted more strongly.
Some relevant excerpts from the book:
If we look for them, most of us can find shifting face-forms hidden among the clouds. We can see the forms of insects hidden in the patterned wallpaper or of snakes nestling among the colourful swirls of a carpet. Such effects need not imply the ingestion of mind-altering substances. Minds like ours are already experts at self-alteration. When we look for our car keys on the cluttered desk, we somehow alter our perceptual processing to help isolate the target item from the rest. Indeed, spotting the (actual) car keys and ‘spotting’ the (non-existent) faces, snakes, and insects are probably not all that different, at least as far as the form of the underlying processing is concerned. Such spottings reflect our abilities not just to alter our action routines (e.g., our visual scan paths) but also to modify the details of our own perceptual processing so as better to extract signal from noise. Such modifications look to play a truly major role in the tuning (both long- and short-term) of the on-board probabilistic prediction machine that underpins our contact with the world. The present chapter explores the space and nature of such online modifications, discusses their relations with familiar notions such as attention and expectation, and displays a possible mechanism (the ‘precision-weighting’ of prediction error) that may be implicated in a wide range of signal-enhancement effects. [...]
The perceptual problems that confront us in daily life vary greatly in the demands they make upon us. For many tasks, it is best to deploy large amounts of prior knowledge, using that knowledge to drive complex proactive patterns of gaze fixation, while for others it may be better to sit back and let the world do as much of the driving as possible. Which strategy (more heavily input-driven or more heavily expectation-driven) is best is also hostage to a multitude of contextual effects. Driving along a very familiar road in heavy fog, it can sometimes be wise to let detailed top-down knowledge play a substantial role. Driving fast along an unfamiliar winding mountain road, we need to let sensory input take the lead. How is a probabilistic prediction machine to cope?
It copes, PP suggests, by continuously estimating and re-estimating its own sensory uncertainty. Within the PP framework, these estimations of sensory uncertainty modify the impact of sensory prediction error. This, in essence, is the predictive processing model of attention. Attention, thus construed, is a means of variably balancing the potent interactions between top-down and bottom-up influences by factoring in their so-called ‘precision’, where this is a measure of their estimated certainty or reliability (inverse variance, for the statistically savvy). This is achieved by altering the weighting (the gain or ‘volume’, to use a common analogy) on the error units accordingly. The upshot of this is to ‘control the relative influence of prior expectations at different levels’ (Friston, 2009, p. 299). Greater precision means less uncertainty and is reflected in a higher gain on the relevant error units (see Friston, 2005, 2010; Friston et al., 2009). Attention, if this is correct, is simply a means by which certain error unit responses are given increased weight, hence becoming more apt to drive response, learning, and (as we shall later see) action. More generally, this means the precise mix of top-down and bottom-up influence is not static or fixed. Instead, the weight given to sensory prediction error is varied according to how reliable (how noisy, certain, or uncertain) the signal is taken to be.
We can illustrate this using our earlier example. Visual input, in the fog, will be estimated to offer a noisy and unreliable guide to the state of the distal realm. Other things being equal visual input should, on a bright day, offer a much better signal, such that any residual error should be taken very seriously indeed. But the strategy clearly needs to be much more finely tuned than that suggests. Thus suppose the fog (as so often happens) briefly clears from one small patch of the visual scene. Then we should be driven to sample preferentially from that smaller zone, as that is now a source of high-precision prediction errors. This is a complex business, since the evidence for the presence of that small zone (right there!) comes only from the (initially low-weighted) sensory input itself. There is no fatal problem here, but the case is worth describing carefully. First, there is now some low-weighted surprise emerging relative to my best current take on the the visual situation (which was something like ‘in uniformly heavy fog’). Aspects of the input (in the clear zone) are not unfolding as that take (that model) predicted. However, my fog-model includes general expectations concerning occasional clear patches. Under such conditions, I can further reduce overall prediction error by swopping to the ‘fog plus clear patch’ model. This model incorporates a new set of precision predictions, allowing me to trust the fine-grained prediction errors computed for the clear zone (only). That small zone is now the estimated source of high-precision prediction errors of the kind the visual system can trust to recruit clear reliable percepts. High-precision prediction errors from the clear zone may then rapidly warrant the recruitment of a new model capable of describing some salient aspects of the local environment environment (watch out for that tractor!).
Such, in microcosm, is the role PP assigns to sensory attention: ‘Attention can be viewed as a selective sampling of sensory data that have high-precision (signal to noise) in relation to the model’s predictions’ (Feldman & Friston, 2010, p. 17). This means that we are constantly engaged in attempts to predict precision, that is, to predict the context-varying reliability of our own sensory prediction error, and that we probe the world accordingly. This kind of ‘predicted-precision based’ probing and sampling also underlies (as we will see in Part II) the PP account of gross motor activity. For the present, the point to notice is that in this noisy and ambiguous world, we need to know when and where to take sensory prediction error seriously, and (more generally) how best to balance top-down expectation and bottom-up sensory input. That means knowing when, where, and how far, to trust specific prediction error signals to select and nuance the model that is guiding our behaviour.
An important upshot is that the knowledge that makes human perception possible concerns not only the layered causal structure of the (action-salient—more on that later) distal world but the nature and context-varying reliability of our own sensory contact with that world. Such knowledge must form part and parcel of the overall generative model. For that model must come to predict both the shape and multiscale dynamics of the impinging sensory signal and the context-variable reliability of the signal itself (see Figure 2.2). The familiar idea of ‘attention’ now falls into place as naming the various ways in which predictions of precision tune and impact sensory sampling, allowing us (when things are working as they should) to be driven by the signal while ignoring the noise. By actively sampling where we expect (relative to some task) the best signal to noise ratio, we ensure that the information upon which we perceive and act is fit for purpose.
Yeah I generally liked that discussion, with a few nitpicks, like I dislike the word “precision”, because I think it’s confidence levels attached to predictions of boolean variables (presence or absence of a feature), rather than a variances attached to predictions of real numbers. (I think this for various reasons including trying to think through particular examples, and my vague understanding of the associated neural mechanisms.)
I would state the fog example kinda differently: There are lots of generative models trying to fit the incoming data, and the “I only see fog” model is currently active, but the “I see fog plus a patch of clear road” model is floating in the background ready to jump in and match to data as soon as there’s data that it’s good at explaining.
I mean, “I am looking at fog” is actually a very specific prediction about visual input—fog has a specific appearance—so the “I am looking at fog” model is falsified (prediction error) by a clear patch. A better example of “low confidence about visual inputs” would be whatever generative models are active when you’re very deep in thought or otherwise totally spaced out, ignoring your surroundings.
The way I think of “activity is modulated dynamically” is:
We’re searching through a space of generative models for the model that best fits the data and lead to the highest reward. The naive strategy would be to execute all the models, and see which one wins the competition. Unfortunately, the space of all possible models is too vast for that strategy to work. At any given time, only a subset of the vast space of all possible generative models is accessible, and only the models in that subset are able to enter the competition. What subset it is can be modulated by context, prior expectations (“you said this cloud is supposed to look like a dog, right?”), etc. I think (vaguely) that there are region-to-region connections within the brain that can be turned on and off, and different models require different configurations of that plumbing in order to fully express themselves. If there’s a strong enough hint that some generative model is promising, that model will flex its muscles and fully actualize itself by creating the appropriate plumbing (region-to-region communication channels) to be properly active and able to flow down predictions.
It’s connecting this sort of “good models get themselves expressed” layer of abstraction to neurons that’s the hard part :) I think future breakthroughs in training RNNs will be a big aid to imagination.
Right now when I pattern-match what tou say onto ANN architectures, I can imagine something like making an RNN from a scale-free network and trying to tune less-connected nodes around different weightings of more-connected nodes. But I expect that in the future, I’ll have much better building blocks for imagining.
In case it helps, my main aids-to-imagination right now are the sequence memory / CHMM story (see my comment here) and Dileep George’s PGM-based vision model and his related follow-up papers like this, plus miscellaneous random other stuff.
What do you mean by “an algorithm that we think of as happening in one step”?
I think of it as analysis-by-synthesis, a.k.a. “search through a space of generative models for one that matches the data”. The search process doesn’t have to be fast, let alone one step—the splotchy pictures in this post are a good example, you stare at them for a while until they snap into place. Right? Or sorry if I’m misunderstanding your point.
I’m saying the abstraction of (e.g.) CNNs as doing their forward pass all in one timestep does not apply to the brain. So I think we agree and I just wasn’t too clear.
For CNNs we don’t worry about top-down control intervening in the middle of a forward pass, and to the extent that engineers might increase chip efficiency by having different operations be done simultaneously, we usually want to ensure that they can’t interfere with each other, maintaining the layer of abstraction. But the human visual cortex probably violates these assumptions not just out of necessity, but gains advantages.
Hierarchical predictive coding is interesting, but I have some misgivings that it does a good job explaining what we see of brain function, because brains seem to have really dramatic attention mechanisms.
By “attention” I don’t mean to imply much similarity to attention mechanisms in current machine learning. I partly mean that not all our cortex is going at full blast all the time—instead, activity is modulated dynamically, and this interacts in a very finely tuned way with the short-term stored state of high-level representations. It seems like there are adaptations in the real-time dynamics of the brain that are finely selected to do interesting and complicated things that I don’t understand well, rather than them trying to faithfully implement an algorithm that we think of as happening in one step.
Not super sure about all this, though.
I’m not sure that this is an argument against predictive coding, because e.g. Surfing Uncertainty talks a lot about how attention fits together with predictive coding and how it involves dynamic modulation of activity.
In the book’s model, attention corresponds to “precision-weighting of prediction error”. An example might be navigating a cluttered room in dim versus bright lightning. If a room is dark and you don’t see much, you may be sensing your way around with your hands or trying to remember where everything is so you don’t run into things. Your attention is on your sense of touch, or your memory of the room. On the other hand, if you can see clearly, then you are probably mainly paying attention to your sense of sight, since that just lets you see where everything is.
Another way of putting this is that if the room is very dark, the sensory data generated by your vision has low precision (low confidence): it is not very useful for generating predictions of where everything is. Your sense of touch, as well as your previous memories, have higher precision than your sense of vision does. As a result, signals coming from the more useful sources are weighted more highly when attempting to model your environment; this is subjectively experienced as your attention being on your hands and previous recollections of the room. Conversely, when it gets brighter, information from your eyes now has higher precision, so will be weighted more strongly.
Some relevant excerpts from the book:
Yeah I generally liked that discussion, with a few nitpicks, like I dislike the word “precision”, because I think it’s confidence levels attached to predictions of boolean variables (presence or absence of a feature), rather than a variances attached to predictions of real numbers. (I think this for various reasons including trying to think through particular examples, and my vague understanding of the associated neural mechanisms.)
I would state the fog example kinda differently: There are lots of generative models trying to fit the incoming data, and the “I only see fog” model is currently active, but the “I see fog plus a patch of clear road” model is floating in the background ready to jump in and match to data as soon as there’s data that it’s good at explaining.
I mean, “I am looking at fog” is actually a very specific prediction about visual input—fog has a specific appearance—so the “I am looking at fog” model is falsified (prediction error) by a clear patch. A better example of “low confidence about visual inputs” would be whatever generative models are active when you’re very deep in thought or otherwise totally spaced out, ignoring your surroundings.
The way I think of “activity is modulated dynamically” is:
We’re searching through a space of generative models for the model that best fits the data and lead to the highest reward. The naive strategy would be to execute all the models, and see which one wins the competition. Unfortunately, the space of all possible models is too vast for that strategy to work. At any given time, only a subset of the vast space of all possible generative models is accessible, and only the models in that subset are able to enter the competition. What subset it is can be modulated by context, prior expectations (“you said this cloud is supposed to look like a dog, right?”), etc. I think (vaguely) that there are region-to-region connections within the brain that can be turned on and off, and different models require different configurations of that plumbing in order to fully express themselves. If there’s a strong enough hint that some generative model is promising, that model will flex its muscles and fully actualize itself by creating the appropriate plumbing (region-to-region communication channels) to be properly active and able to flow down predictions.
Or something like that… :-)
It’s connecting this sort of “good models get themselves expressed” layer of abstraction to neurons that’s the hard part :) I think future breakthroughs in training RNNs will be a big aid to imagination.
Right now when I pattern-match what tou say onto ANN architectures, I can imagine something like making an RNN from a scale-free network and trying to tune less-connected nodes around different weightings of more-connected nodes. But I expect that in the future, I’ll have much better building blocks for imagining.
In case it helps, my main aids-to-imagination right now are the sequence memory / CHMM story (see my comment here) and Dileep George’s PGM-based vision model and his related follow-up papers like this, plus miscellaneous random other stuff.
What do you mean by “an algorithm that we think of as happening in one step”?
I think of it as analysis-by-synthesis, a.k.a. “search through a space of generative models for one that matches the data”. The search process doesn’t have to be fast, let alone one step—the splotchy pictures in this post are a good example, you stare at them for a while until they snap into place. Right? Or sorry if I’m misunderstanding your point.
I’m saying the abstraction of (e.g.) CNNs as doing their forward pass all in one timestep does not apply to the brain. So I think we agree and I just wasn’t too clear.
For CNNs we don’t worry about top-down control intervening in the middle of a forward pass, and to the extent that engineers might increase chip efficiency by having different operations be done simultaneously, we usually want to ensure that they can’t interfere with each other, maintaining the layer of abstraction. But the human visual cortex probably violates these assumptions not just out of necessity, but gains advantages.