Off the top of my head, here are some new things it adds:
1. You have 3 ways of avoiding prediction error: updating your models, changing your perception, acting on the world. Those are always in play and you often do all three in some combination (see my model of confirmation bias in action). 2. Action is key, and it shapes and is shaped by perception. The map you build of any territory is prioritized and driven by the things you can act on most effectively. You don’t just learn “what is out there” but “what can I do with it”. 3. You care about prediction over the lifetime scale, so there’s an explore/exploit tradeoff between potentially acquiring better models and sticking with the old ones. 4. Prediction goes from the abstract to the detailed. You perceive specifics in a way that aligns with your general model, rarely in contradiction. 5. Updating always goes from the detailed to the abstract. It explains Kuhn’s paradigm shifts but for everything — you don’t change your general theory and then update the details, you accumulate error in the details and then the general theory switches all at once to slot them into place. 6. In general, your underlying models are a distribution but perception is always unified, whatever your leading model is. So when perception changes it does so abruptly. 7. Attention is driven in a Bayesian way, to the places that are most likely to confirm/disconfirm your leading hypothesis, balancing the accuracy of perceiving the attended detail correctly and the leverage of that detail to your overall picture. 8. Emotions through the lens of PP. 9. Identity through the lens of PP. 10. The above is fractal, applying at all levels from a small subconscious module to a community of people.
FYI Jacobian, very high in the review-request-thread is a post on neural annealing. I think many people would be interested in reading your review of that post.
PP is not one thing. This makes it very difficult for me to say what I don’t like about it, since no one element seems to be necessarily present in all the different versions. What follows are some remarks about specific ideas I’ve seen associated with PP, many of them contradictory. Do let me know which ideas you endorse / don’t endorse.
It is also possible that each of my points is based on a particular misconception about PP. While I’ve made some effort to be well-informed about PP, I have not spent so much time on it, so my understanding is definitely shallow.
The three main meanings of PP (each of which is a cluster, containing many many different sub-meanings, as you flesh out the details in different ways):
A theory of perception. If you look PP up on Wikipedia, the term primarily refers to a theory of perceptual processing in which prediction plays a central role, and observations interact with predictions to provide a feedback signal for learning. So, the theory is that perception is fundamentally about minimizing prediction error. I basically believe this theory. So let’s set it aside.
A theory of action. Some people took the idea “the brain minimizes prediction error” and tried to apply it to motor control, too—and to everything else in the brain. I think this kind of made sense as a thing to try (unifying these two things is a worthwhile goal!), but doesn’t go anywhere. I’ll have a lot to say about this. This theory is what I’ll mean when I say PP—it is, in my experience, what rationalists and rationalist-adjacent people primarily mean by “PP”.
A theory of everything. Friston’s free-energy principle. This is not only supposed to apply to the human brain, but also evolution, and essentially any physical system. I have it on good authority that the math in Friston’s papers is full of errors, and no one who has been excited about this (that I’ve seen) has also claimed to understand it.
1. You have 3 ways of avoiding prediction error: updating your models, changing your perception, acting on the world. Those are always in play and you often do all three in some combination (see my model of confirmation bias in action).
The PP theory of perception says that the brain “minimizes prediction error” in the sense that it is always engaged in the business of predicting, and compares the predictions to observations in order to generate feedback. This could be like gradient descent, or like Bayesian updates.
Actively planning to minimize prediction error, or learning policies which minimize prediction error, is a totally different thing which requires different mechanisms.
Consider that minimizing prediction error in the sense required for prediction means making each individual prediction as accurate as possible, which means, being totally myopic. An error on a specific prediction means making an adjustment on that specific prediction. The credit assignment problem is easily solved; we know exactly what led to that specific prediction, so we can propagate all the relevant errors and make the necessary adjustments.
On the other hand, with planning and policy learning, there is a nontrivial (indeed, severe) credit assignment problem. We don’t know which outputs lead to which error signals later. Therefore, we need an entirely different learning mechanism. Indeed, as I argued in The Credit Assignment Problem, we basically need a world model in order to assign credit. This makes it very hard to unify the theory of perception with the theory of action, because one needs the other as input!
In any case, why do you want to suppose that humans take actions in a way which minimizes prediction error? I think this is a poor model. There’s the standard “dark room problem” objection: if humans wanted to minimize prediction error, they would like sensory deprivation chambers a whole lot more than they seem to. Instead, humans like to turn on the radio, watch TV, read a book, etc when they don’t have anything else to do. Simply put, we are curious creatures, who do not like being bored. Yes, we also don’t like too much excitement of the wrong kind, but we are closer to infophillic than infophobic! And this makes sense from an evolutionary perspective. Machine learning has found that reinforcement learning agents do better when you have a basic mechanism to encourage exploration, because it’s easy to under-explore, but hard to properly explore. One way to do this is to actively reinforce prediction error; IE, the agents are actually maximizing prediction error! (as one component of more complicated values, perhaps.)
I’ve seen PP blog posts take this in stride, explaining that it’s important to explore in order to get better at doing things so that you can minimize prediction error later. I’ve seen technical derivations of a “curiosity drive” on this premise. And sure, that’s technically true. But that doesn’t change that you’re postulating a drive which discourages exploration, all things considered, when it’s more probable (based on parallels with RL) that evolution would add a drive to explicitly encourage exploration.
Perhaps this is part of why one of the most common PP formalisms doesn’t actually propose to minimize prediction error in either of the two above senses (IE, correcting predictions via feedback, or taking actions which make future prediction error less).
The primary theoretical tool by which PP seeks to explain action is active inference. According to this method, we can select actions by first conditioning on our success, and then sampling actions from that distribution. I sometimes see this justified as a practical way to leverage inference machinery to make decisions. We can judge that on its pragmatic merits. (I think it’s not common to use it purely to get the job done—techniques such as reinforcement learning mostly work better.) Other times, I’ve heard it associated with the idea that people can’t conceive of failure (particularly true failure of core values), or with other forms of wishful thinking.
My first complaint is that this is usually not different enough from standard Bayesian decision theory to account for the biases it purports to predict. For example, to plan to avoid death, you have to start with a realistic world-model which includes all the ways you could die, and then condition on not dying, and then sample actions from that.
In what sense are you “incapable of conceiving of death” if your computations manage to successfully identify potential causes of death and create plans which avoid them?
In what sense are you engaging in wishful thinking, if your planning algorithms work?
One might say: “The psychological claim of wishful thinking isn’t that humans fail to take disaster into account when they plan; the claim is, rather, that humans plan while inhabiting a psychological perspective in which they can’t fail. This lines up with the idea of sampling from the probability distribution in which failure isn’t an option.”
But this is too extreme. It’s true, when I idly muse about the future, I have a tendency to exclude my own death from it. Yet, I have a visceral fear of heights. When I am near the edge of a cliff, I feel like I am going to fall off and die. This image loops repeatedly even though it has never happened to me and my probability of taking a few steps forward and falling is very low. (It’s a fascinating experience: I often stand near ledges on purpose to experience the strong, visceral, unshakable belief that I’m about to fall, which fails to update on all evidence to the contrary.) If I were simply cognizing in the probability distribution which excludes death, I would avoid ledges and cliffs without thinking explicitly about the negative consequences.
And humans are quite capable of explicitly discussing the possibility of death, too.
My second issue with planning by inference is that it also introduces new biases—strange, inhuman biases.
In particular, a planning-by-inference agent cannot conceive of novel, complicated plans which achieve its goals. This is because updating on success doesn’t shift you from your prior as much as it should.
Suppose there is a narrow walkway across an abyss. You are a video game character: you have four directions you can walk (N, S, E, W) at any time. To get across the walkway, you have to go N thirty times in a row.
There are two ways to achieve success: you can open the chest next to you, which achieves success 10% of the time, and otherwise, results in the walkway disappearing. Or, you can cross the walkway, and open the box on the other side. This results in success 100% of the time. You know all of this.
Bayesian decision theory would recommend crossing the walkway.
Planning by inference will almost always open the nearby chest instead.
To see why, remember than we update on our prior. Since we don’t already know the optimal plan, our prior on actions is an even distribution between N, S, E and W at all time-steps. This means crossing the walkway has a prior probability of approximately 10−20. Updating this prior on success, we find that it’s far more probable that we’ll succeed by opening the nearby chest.
Technical aside—the sense in which planning by inference minimizes prediction error is: it minimizes KL divergence between its action distribution and the distribution conditioned on success. (This is just a fancy way of saying you’re doing your best to match those probabilities.) It’s important to keep in mind that this is vaaastly different from actively planning to avoid prediction error. There is no “dark room problem” here. Indeed, planning-by-inference encourages exploration, rather than suppressing it—perhaps to the point of over-exploring (because planning-by-inference agents continue to use sub-optimal plans with frequency proportional to their probability of success, long after they’ve fully explored the possibilities).
2. Action is key, and it shapes and is shaped by perception. The map you build of any territory is prioritized and driven by the things you can act on most effectively. You don’t just learn “what is out there” but “what can I do with it”.
How are you comparing standard bayesian thinking with PP, such that PP comes out ahead in this respect?
Standard bayesian learning theory does just fine learning about acting, along with learning about everything else.
Standard bayesian decision theory offers a theory of acting based on that information.
Granted, standard bayesian theory has the agent learning about everything, regardless of its usefulness, rather than learning specifically those things which help it act. This is because standard Bayesian theory assumes sufficient processing power to fully update beliefs. However, I am unaware of any PP theory which improves on this state of affairs. Free-energy-minimization models can help deal with limited processing power by variational bayesian inference, but this minimizes the error of all beliefs, rather than providing a tool to specifically focus on those beliefs which will be useful for action (again, to my knowledge). Practical bayesian inference has some tools for focusing inference on the most useful parts, but I have never seen those tools especially associated with PP theory.
3. You care about prediction over the lifetime scale, so there’s an explore/exploit tradeoff between potentially acquiring better models and sticking with the old ones.
I’ve already mentioned some ways in which I think the PP treatment of explore/exploit is not a particularly good one. I think machine learning research has generated much better tools.
4. Prediction goes from the abstract to the detailed. You perceive specifics in a way that aligns with your general model, rarely in contradiction.
5. Updating always goes from the detailed to the abstract. It explains Kuhn’s paradigm shifts but for everything — you don’t change your general theory and then update the details, you accumulate error in the details and then the general theory switches all at once to slot them into place.
6. In general, your underlying models are a distribution but perception is always unified, whatever your leading model is. So when perception changes it does so abruptly.
This is the perceptual part of PP theory, which I have few issues with.
7. Attention is driven in a Bayesian way, to the places that are most likely to confirm/disconfirm your leading hypothesis, balancing the accuracy of perceiving the attended detail correctly and the leverage of that detail to your overall picture.
This is one part of perceptual PP which I do have an issue with. I have often read PP accounts of attention with some puzzlement.
PP essentially models perception as one big bayesian network with observations at the bottom and very abstract ideas at the top—which, fair enough. Attention is then modeled as a process which focuses inference on those parts of the network experiencing the most discordance between the top-down predictions and the bottom-up observations. This algorithm makes a lot of sense: there are similar algorithms in machine learning, for focusing belief propagation on the points where it is currently most needed, in order to efficiently propagate large changes across the network before we do any fine-tuning by propagating smaller, less-likely-to-be-important changes. (Why would the brain, a big parallel machine, need such an optimization? Why not propagate all the messages at once, in parallel? Because, biologically, we want to conserve resources. Areas of the brain which are doing more thinking actively consume more oxygen from the blood. Thinking hard is exhausting because it literally takes more energy.)
So far so good.
The problem is, this does not explain conscious experience of attention. I think people are conflating this kind of processing prioritization with conscious experience. They see this nice math of “surprise” in bayesian networks (IE, discordance between bottom-up and top-down messages), and without realizing it, they form a mental image of a humunculus sitting outside the bayesian network and looking at the more surprising regions. (Because this reflects their internal experience pretty well.)
So, how can we get a similar picture without the humunculus?
One theory is that conscious experience is a global workspace which many areas in the brain have fast access to, for the purpose of quickly propagating information that is important to a lot of processes in the brain. I think this theory is a pretty good one. But this is very different from the bayes-net-propagation-prioritization picture. This LW post discusses the discordance.
This isn’t so much a strike against the PP picture of attention (it seems quite possible something like the PP mechanism is present), as a statement that there’s also something else going on—another distinct attention mechanism, which isn’t best understood in PP terms. Maybe which isn’t best understood in terms of a big bayes net, either, since it doesn’t really make sense for a big bayes net to have a global workspace.
If we imagine that the neocortex is more or less a big bayes net (with cortical columns as nodes), and the rest of the brain is (among other things, perhaps) an RL agent which utilizes the neocortex as its model, then this secondary attention mechanism is like a filter which determines which information gets from the neocortex to the RL agent. It can, of course, use the PP notion of attention as a strong heuristic determining how to filter information. I don’t think this necessarily captures everything that’s going on, but it is, in my opinion, better than the pure PP model.
I don’t want to get mired down in discussing the details of predictive processing (least of all, the details of Friston’s free energy). Feel welcomed to express any specific points you have, by all means. (I’d love a point by point response!!) But what I would really like to know is why you are interested in predictive processing in the first place. All the potential reasons I see seem to be based on empty promises. Yet, PP fans seems to think the ideas will eventually bear fruit. What heuristic is behind this positive expectation? Why are the ideas so promising? What’s so exciting about what you’ve seen? What are the deep generators?
There’s a whole lot to respond to here, and it may take the length of Surfing Uncertainty to do so. I’ll point instead to one key dimension.
You’re discussing PP as a possible model for AI, whereas I posit PP as a model for animal brains. The main difference is that animal brains are evolved and occur inside bodies.
Evolution is the answer to the dark room problem. You come with prebuilt hardware that is adapted a certain adaptive niche, which is equivalent to modeling it. Your legs are a model of the shape of the ground and the size of your evolutionary territory. Your color vision is a model of berries in a bush, and your fingers that pick them. Your evolved body is a hyperprior you can’t update away. In a sense, you’re predicting all the things that are adaptive: being full of good food, in the company of allies and mates, being vigorous and healthy, learning new things. Lying hungry in a dark room creates a persistent error in your highest-order predictive models (the evolved ones) that you can’t change.
Your evolved prior supposes that you have a body, and that the way you persist over time is by using that body. You are not a disembodied agent learning things for fun or getting scored on some limited test of prediction or matching. Everything your brain does is oriented towards acting on the world effectively.
You can see that perception and action rely on the same mechanism in many ways, starting with the simple fact that when you look at something you don’t receive a static picture, but rather constantly saccade and shift your eyes, contract and expand your pupil and cornea, move your head around, and also automatically compensate for all of this motion. None of this is relevant to an AI who processes images fed to it “out of the void”, and whose main objective function is something other than maintaining homeostasis of a living, moving body.
Zooming out, Friston’s core idea is a direct consequence of thermodynamics: for any system (like an organism) to persist in a state of low entropy (e.g. 98°F) in an environment that is higher entropy but contains some exploitable order (e.g. calories aren’t uniformly spread in the universe but concentrated in bananas), it must exploit this order. Exploiting it is equivalent to minimizing surprise, since if you’re surprised there some pattern of the world that you failed to make use of (free energy).
Now if you just apply this basic principle to your genes persisting over an evolutionary time scale and your body persisting over the time scale of decades and this sets the stage for PP applied to animals.
Zooming out, Friston’s core idea is a direct consequence of thermodynamics: for any system (like an organism) to persist in a state of low entropy (e.g. 98°F) in an environment that is higher entropy but contains some exploitable order (e.g. calories aren’t uniformly spread in the universe but concentrated in bananas), it must exploit this order. Exploiting it is equivalent to minimizing surprise, since if you’re surprised there some pattern of the world that you failed to make use of (free energy).
I haven’t yet understood the mathematical details of Friston’s arguments. I’ve been told that some of them are flawed. But it’s plausible to me that the particular mathematical argument you’re pointing at here is OK. However, I doubt the conclusion of that argument would especially convince me that the brain is set up with the particular sort of architecture described by PP. This, it seems to me, gets into the domain of PP as a theoretical model of ideal agency as opposed to a specific neurological hypothesis.
Humans did not perfectly inherit the abstract goals which would have been most evolutionary beneficial. We are not fitness-maximizers. Similarly, even if all intelligent beings need to avoid entropy in order to keep living, that does not establish that we are entropy-minimizers at the core of our motivation system. As per my sibling comment, that’s like looking at a market economy and concluding that everyone is a money-maximizer. It’s not a necessary supposition, because we can also explain everyone’s money-seeking behavior by pointing out that money is very useful.
You can see that perception and action rely on the same mechanism in many ways, starting with the simple fact that when you look at something you don’t receive a static picture, but rather constantly saccade and shift your eyes, contract and expand your pupil and cornea, move your head around, and also automatically compensate for all of this motion.
How does this suggest that perception and action rely on the same mechanism, as opposed to are very intertwined? I would certainly agree that motor control in vision has tight feedback loops with vision itself. What I don’t believe is that we should model this as acting so as to minimize prediction loss. For one thing, I’ve read that a pretty good model of saccade movement patterns is that we look at the most surprising parts of the image, which would be better-modeled by moving eyes so as to maximize predictive loss.
Babies look longer at objects which they find surprising, as opposed to those which they recognize.
It’s true that PP can predict some behaviors like this, because you’d do this in order to learn, so that you minimize future prediction error. But that doesn’t mean PP is helping us predict those eye movements.
In a world dependent on money, a money-minimizing person might still have to obtain and use money in order to survive and get to a point where they can successfully do without money. That doesn’t mean we can look at money-seeking behavior and conclude that a person is a money-minimizer. More likely that they’re a money-maximizer. But they could be any number of things, because in this world, you have to deal with money in a broad variety of circumstances.
Let me briefly sketch an anti-PP theory. According to what you’ve said so far, I understand you as saying that we act in a way which minimizes prediction error, but according to a warped prior which doesn’t just try to model reality statistically accurately, but rather, increases the probability of things like food, sex, etc in accordance with their importance (to evolutionary fitness). This causes us to seek those things.
My anti-PP theory is this: we act in a way which maximizes prediction error, but according to a warped prior which doesn’t just model reality statistically accurately, but rather, decreases the probability of things like food, sex, etc in accordance with their importance. This causes us to seek those things.
I don’t particularly believe anti-PP, but I find it to be more plausible than PP. It fits human behavior better. It fits eye saccades better. (The eye hits surprising parts of the image, plus sexually significant parts of the image. It stands to reason that sexually significant images are artificially “surprising” to our visual system, making them more interesting.) It fits curiosity and play behavior better.
By the way, I’m actually much more amenable to the version of PP in Kaj Sotala’s post on craving, where warping epistemics by forcing belief in success is just one motivation among several in the brain. I do think something similar to that seems to happen, although my explanation for it is much different (see my earlier comment). I just don’t buy that this is the basic action mechanism of the brain, governing all our behavior, since it seems like a large swath of our behavior is basically the opposite of what you’d expect under this hypothesis. Yes, these predictions can always be fixed by sufficiently modifying the prior, forcing the “pursuing minimal prediction error” hypothesis to line up with the data we see. However, because humans are curious creatures who look at surprising things, engage in experimental play, and like to explore, you’re going to have to take a sensible probability distribution and just about reverse the probabilities to explain those observations. At that point, you might as well switch to anti-PP theory.
You’re discussing PP as a possible model for AI, whereas I posit PP as a model for animal brains. The main difference is that animal brains are evolved and occur inside bodies.
So, for your project of re-writing rationality in PP, would PP constitute a model of human irrationality, and how to rectify it, in contrast to ideal rationality (which would not be well-described by PP)?
Or would you employ PP both as a model which explains human irrationality and as an ideal rationality notion, so that we can use it both as the framework in which we describe irrationality and as the framework in which we can understand what better rationality would be?
Evolution is the answer to the dark room problem. You come with prebuilt hardware that is adapted a certain adaptive niche, which is equivalent to modeling it. Your legs are a model of the shape of the ground and the size of your evolutionary territory. Your color vision is a model of berries in a bush, and your fingers that pick them. Your evolved body is a hyperprior you can’t update away. In a sense, you’re predicting all the things that are adaptive: being full of good food, in the company of allies and mates, being vigorous and healthy, learning new things. Lying hungry in a dark room creates a persistent error in your highest-order predictive models (the evolved ones) that you can’t change.
Am I right in inferring from this that your preferred version of PP is one where we explicitly plan to minimize prediction error, as opposed to the Active Inference model (which instead minimizes KL divergence)? Or do you endorse an Active Inference type model?
This explanation in terms of evolution makes the PP theory consistent with observations, but does not give me a reason to believe PP. The added complexity to the prior is similar to the added complexity of other kinds of machinery to implement drives, so as yet I see no reason to prefer this explanation to other possibly explanations of what’s going on in the brain.
My remarks about problems with different versions of PP can each be patched in various ways; these are not supposed to be “gotcha” arguments in the sense of “PP can’t explain this! / PP can’t deal with this!”. Rather, I’m trying to boggle at why PP looks promising in the first place, as a hypothesis to raise to our attention.
Each of the arguments I mentioned are about one way I might see that someone might think PP is doing some work for us, and why I don’t see that as a promising avenue.
So I remain curious what the generators of your view are.
I suspect some of the things that you want to use PP for, I would rather use my machine-learning model of meditation. The basic idea is that we are something like a model-based RL agent, but (pathologically) have some control over our attention mechanism. We can learn what kind of attention patterns are more useful. But we can also get our attention patterns into self-reinforcing loops, where we attend to the things which reinforce those attention patterns, and not things which punish them.
For example, when drinking too much, we might resist thinking about how we’ll hate ourselves tomorrow. This attention pattern is self-reinforcing, because it lets us drink more (yay!), while refusing to spend the necessary attention to propagate the negative consequences which might stop that behavior (and which would also harm the attention pattern). All our hurting tomorrow won’t de-enforce the pattern very effectively, because that pattern isn’t very active to be de-enforced, tomorrow. (RL works by propagating expected pain/pleasure shortly after we do things—it can achieve things on long time horizons because the expected pain/pleasure includes expectations on long time horizons, but the actual learning which updates an action only happens soon after we take that action.)
Wishful thinking works by avoiding painful thoughts. This is a self-reinforcing attention pattern for the same reason: if we avoid painful thoughts, we in particular avoid propagating the negative consequences of avoiding painful thoughts. Avoiding painful thoughts feels useful in the moment, because pain is pain. But this causes us to leave that important paperwork in the desk drawer for months, building up the problem, making us avoid it all the more. The more successful we are at not noticing it, the less the negative consequences propagate to the attention pattern which is creating the whole problem.
I have a weaker story for confirmation bias. Naturally, confirming a theory feels good, and getting disconfirmation feels bad. (This is not because we experience the basic neural feedback of perceptual PP as pain/pleasure, which would make us seek predictability and avoid predictive error—I don’t think that’s true, as I’ve discussed at length. Rather, this is more of a social thing. It feels bad to be proven wrong, because that often has negative consequences, especially in the ancestral environment.)
So attention patterns (and behavior patterns) which lead to being proven right will be reinforced. This is effectively one of those pathological self-reinforcing attention patterns, since it avoids its own disconfirmation, and hence, avoids propagating the consequences which would de-enforce it.
I would predict confirmation bias is strongest when we have every social incentive to prove ourselves right.
However, I doubt my story is the full story of confirmation bias. It doesn’t really explain performance in the task where you have to flip over cards to check whether “every vowel has an even number on the other side” or such things.
In any case, my theory is very much a just-so story which I contrived. Take with heap of salt.
PP tells us there are three ways you make you predictions match sensory input: 1. Change your underlying models and their predictions based on what you see. 2. Change your perception to fit with what you predicted. 3. Act on the world to bring the two into alignment.
I would clarify that #1 and #2 happen together. Given a large difference between prediction and observation, a confident prediction somewhat overwrites the perception (which helps us deal with noisy data), but the prediction is weakened, too.
And #3 is, of course, something I argued against in my other reply.
You meet cyan skinned people. If they’re blunt, you perceive that as nastiness. If they’re tactful, you perceive that as dishonesty. You literally see facial twitches and hear notes that aren’t there, PP making confirmation bias propagate all the way down to your basic senses.
Right, this makes sense.
If they’re actually nice, your brain gets a prediction error signal and tries to correct it with action. You taunt to provoke nastiness, or become intimidating to provoke dishonesty. You grow ever more confident in your excellent intuition with regards to those cyan bastards.
Why do you believe this?
I can believe that, in social circumstances, people act so as to make their predictions get confirmed, because this is important to group status. For example, (subconsciously) socially engineering a situation where the cyan-skinned person is trapped in a catch 22, where no matter what they do, you’ll be able to fit it into your narrative.
What I don’t believe in is a general mechanism whereby you act so as to confirm your predictions.
I already stated several reasons in my other comment. First, this does not follow easily from the bayes-net-like mechanisms of perceptual PP theory. They minimize prediction error in a totally different sense, reactively weakening parts of models which resulted in poor predictions, and strengthening models which had strong predictions. This offers no mechanism by which actions would be optimized in a way such that we proactively minimize prediction error thru our actions.
Second, it doesn’t fit, by and large, with human behavior. Humans are curious infovores; a better model would be that we actively plan to maximize prediction error, seeking out novel stimulus by steering toward parts of the state-space where our current predictive ability is poor. (Both of these models are poor, but the information-loving model is better.) Give a human a random doodad and they’ll fiddle with it by doing things to see what will happen. I think people make a sign error, thinking PP predicts info-loving behavior because this maximizes learning, which intuitively might sound like minimizing prediction error. But it’s quite the opposite: maximizing learning means planning to maximize prediction error.
Third, the activity of any highly competent agent will naturally be highly predictable to that agent, so it’s easy to think that it’s “minimizing prediction error” by following probable lines of action. This explains away a lot of examples of “minimizing prediction error”, in that we don’t need to posit any separate mechanism to explain what’s going on. A highly competent agent isn’t necessarily actively minimizing prediction error, just because it’s managed to steer things into a predictable state. It’s got other goals.
Furthermore, anything which attempts to maintain any kind of homeostasis will express behaviors which can naturally be described as “reducing errors”—we put on a sweater when it’s too cold, take it off when it’s too hot, etc. If we’re any good at maintaining our homeostasis, this broadly looks sorta like minimizing prediction error (because statistically, we’re typically closer to our homeostatic set point), but it’s not.
This is why confirmation bias is the mother of all bias. CB doesn’t just conveniently ignore conflicting data. It reinforces itself in your explicit beliefs, in unconscious intuition, in raw perception, AND in action. It can grow from nothing and become impossible to dislodge.
I consider this to be on shaky grounds. Perceptual PP theory is abstracted from the math of bayesian networks, which avoid self-reinforcing beliefs like this. As I mentioned earlier, #1 and #2 happen simultaneously. So the top-down theories should weaken, even as they impose themselves tyrannically on perception. A self-reinforcing feedback loop requires a more complicated explanation.
On the other hand, this can happen in loopy bayesian networks, when approximate inference is done via loopy belief prop. For example, there’s a formal result that Gaussian bayes nets end up with the correct mean-value beliefs, but with too high confidence.
So, maybe.
But loopy belief prop is just one approximate inference method for bayes nets, and it makes sense that evolution would fine-tune the inference of the brain to perform quite well at perceptual tasks. This could include adjustments to account for the predictable biases of loopy belief propagation, EG artificially decreasing confidence to make it closer to what it should be.
My point isn’t that you’re outright wrong about this one, it just seems like it’s not a strong prediction of the model.
What I don’t believe in is a general mechanism whereby you act so as to confirm your predictions.
I had understood (via one-sentence summary, so lossy in the extreme) that this was approximately how motor control worked. Is this a wrong understanding? If not, what separates the motor control mechanism from the perception mechanism?
Off the top of my head, here are some new things it adds:
1. You have 3 ways of avoiding prediction error: updating your models, changing your perception, acting on the world. Those are always in play and you often do all three in some combination (see my model of confirmation bias in action).
2. Action is key, and it shapes and is shaped by perception. The map you build of any territory is prioritized and driven by the things you can act on most effectively. You don’t just learn “what is out there” but “what can I do with it”.
3. You care about prediction over the lifetime scale, so there’s an explore/exploit tradeoff between potentially acquiring better models and sticking with the old ones.
4. Prediction goes from the abstract to the detailed. You perceive specifics in a way that aligns with your general model, rarely in contradiction.
5. Updating always goes from the detailed to the abstract. It explains Kuhn’s paradigm shifts but for everything — you don’t change your general theory and then update the details, you accumulate error in the details and then the general theory switches all at once to slot them into place.
6. In general, your underlying models are a distribution but perception is always unified, whatever your leading model is. So when perception changes it does so abruptly.
7. Attention is driven in a Bayesian way, to the places that are most likely to confirm/disconfirm your leading hypothesis, balancing the accuracy of perceiving the attended detail correctly and the leverage of that detail to your overall picture.
8. Emotions through the lens of PP.
9. Identity through the lens of PP.
10. The above is fractal, applying at all levels from a small subconscious module to a community of people.
FYI Jacobian, very high in the review-request-thread is a post on neural annealing. I think many people would be interested in reading your review of that post.
(Thank you very much for this review as well :D )
PP is not one thing. This makes it very difficult for me to say what I don’t like about it, since no one element seems to be necessarily present in all the different versions. What follows are some remarks about specific ideas I’ve seen associated with PP, many of them contradictory. Do let me know which ideas you endorse / don’t endorse.
It is also possible that each of my points is based on a particular misconception about PP. While I’ve made some effort to be well-informed about PP, I have not spent so much time on it, so my understanding is definitely shallow.
The three main meanings of PP (each of which is a cluster, containing many many different sub-meanings, as you flesh out the details in different ways):
A theory of perception. If you look PP up on Wikipedia, the term primarily refers to a theory of perceptual processing in which prediction plays a central role, and observations interact with predictions to provide a feedback signal for learning. So, the theory is that perception is fundamentally about minimizing prediction error. I basically believe this theory. So let’s set it aside.
A theory of action. Some people took the idea “the brain minimizes prediction error” and tried to apply it to motor control, too—and to everything else in the brain. I think this kind of made sense as a thing to try (unifying these two things is a worthwhile goal!), but doesn’t go anywhere. I’ll have a lot to say about this. This theory is what I’ll mean when I say PP—it is, in my experience, what rationalists and rationalist-adjacent people primarily mean by “PP”.
A theory of everything. Friston’s free-energy principle. This is not only supposed to apply to the human brain, but also evolution, and essentially any physical system. I have it on good authority that the math in Friston’s papers is full of errors, and no one who has been excited about this (that I’ve seen) has also claimed to understand it.
The PP theory of perception says that the brain “minimizes prediction error” in the sense that it is always engaged in the business of predicting, and compares the predictions to observations in order to generate feedback. This could be like gradient descent, or like Bayesian updates.
Actively planning to minimize prediction error, or learning policies which minimize prediction error, is a totally different thing which requires different mechanisms.
Consider that minimizing prediction error in the sense required for prediction means making each individual prediction as accurate as possible, which means, being totally myopic. An error on a specific prediction means making an adjustment on that specific prediction. The credit assignment problem is easily solved; we know exactly what led to that specific prediction, so we can propagate all the relevant errors and make the necessary adjustments.
On the other hand, with planning and policy learning, there is a nontrivial (indeed, severe) credit assignment problem. We don’t know which outputs lead to which error signals later. Therefore, we need an entirely different learning mechanism. Indeed, as I argued in The Credit Assignment Problem, we basically need a world model in order to assign credit. This makes it very hard to unify the theory of perception with the theory of action, because one needs the other as input!
In any case, why do you want to suppose that humans take actions in a way which minimizes prediction error? I think this is a poor model. There’s the standard “dark room problem” objection: if humans wanted to minimize prediction error, they would like sensory deprivation chambers a whole lot more than they seem to. Instead, humans like to turn on the radio, watch TV, read a book, etc when they don’t have anything else to do. Simply put, we are curious creatures, who do not like being bored. Yes, we also don’t like too much excitement of the wrong kind, but we are closer to infophillic than infophobic! And this makes sense from an evolutionary perspective. Machine learning has found that reinforcement learning agents do better when you have a basic mechanism to encourage exploration, because it’s easy to under-explore, but hard to properly explore. One way to do this is to actively reinforce prediction error; IE, the agents are actually maximizing prediction error! (as one component of more complicated values, perhaps.)
I’ve seen PP blog posts take this in stride, explaining that it’s important to explore in order to get better at doing things so that you can minimize prediction error later. I’ve seen technical derivations of a “curiosity drive” on this premise. And sure, that’s technically true. But that doesn’t change that you’re postulating a drive which discourages exploration, all things considered, when it’s more probable (based on parallels with RL) that evolution would add a drive to explicitly encourage exploration.
Perhaps this is part of why one of the most common PP formalisms doesn’t actually propose to minimize prediction error in either of the two above senses (IE, correcting predictions via feedback, or taking actions which make future prediction error less).
The primary theoretical tool by which PP seeks to explain action is active inference. According to this method, we can select actions by first conditioning on our success, and then sampling actions from that distribution. I sometimes see this justified as a practical way to leverage inference machinery to make decisions. We can judge that on its pragmatic merits. (I think it’s not common to use it purely to get the job done—techniques such as reinforcement learning mostly work better.) Other times, I’ve heard it associated with the idea that people can’t conceive of failure (particularly true failure of core values), or with other forms of wishful thinking.
My first complaint is that this is usually not different enough from standard Bayesian decision theory to account for the biases it purports to predict. For example, to plan to avoid death, you have to start with a realistic world-model which includes all the ways you could die, and then condition on not dying, and then sample actions from that.
In what sense are you “incapable of conceiving of death” if your computations manage to successfully identify potential causes of death and create plans which avoid them?
In what sense are you engaging in wishful thinking, if your planning algorithms work?
One might say: “The psychological claim of wishful thinking isn’t that humans fail to take disaster into account when they plan; the claim is, rather, that humans plan while inhabiting a psychological perspective in which they can’t fail. This lines up with the idea of sampling from the probability distribution in which failure isn’t an option.”
But this is too extreme. It’s true, when I idly muse about the future, I have a tendency to exclude my own death from it. Yet, I have a visceral fear of heights. When I am near the edge of a cliff, I feel like I am going to fall off and die. This image loops repeatedly even though it has never happened to me and my probability of taking a few steps forward and falling is very low. (It’s a fascinating experience: I often stand near ledges on purpose to experience the strong, visceral, unshakable belief that I’m about to fall, which fails to update on all evidence to the contrary.) If I were simply cognizing in the probability distribution which excludes death, I would avoid ledges and cliffs without thinking explicitly about the negative consequences.
And humans are quite capable of explicitly discussing the possibility of death, too.
My second issue with planning by inference is that it also introduces new biases—strange, inhuman biases.
In particular, a planning-by-inference agent cannot conceive of novel, complicated plans which achieve its goals. This is because updating on success doesn’t shift you from your prior as much as it should.
Suppose there is a narrow walkway across an abyss. You are a video game character: you have four directions you can walk (N, S, E, W) at any time. To get across the walkway, you have to go N thirty times in a row.
There are two ways to achieve success: you can open the chest next to you, which achieves success 10% of the time, and otherwise, results in the walkway disappearing. Or, you can cross the walkway, and open the box on the other side. This results in success 100% of the time. You know all of this.
Bayesian decision theory would recommend crossing the walkway.
Planning by inference will almost always open the nearby chest instead.
To see why, remember than we update on our prior. Since we don’t already know the optimal plan, our prior on actions is an even distribution between N, S, E and W at all time-steps. This means crossing the walkway has a prior probability of approximately 10−20. Updating this prior on success, we find that it’s far more probable that we’ll succeed by opening the nearby chest.
Technical aside—the sense in which planning by inference minimizes prediction error is: it minimizes KL divergence between its action distribution and the distribution conditioned on success. (This is just a fancy way of saying you’re doing your best to match those probabilities.) It’s important to keep in mind that this is vaaastly different from actively planning to avoid prediction error. There is no “dark room problem” here. Indeed, planning-by-inference encourages exploration, rather than suppressing it—perhaps to the point of over-exploring (because planning-by-inference agents continue to use sub-optimal plans with frequency proportional to their probability of success, long after they’ve fully explored the possibilities).
How are you comparing standard bayesian thinking with PP, such that PP comes out ahead in this respect?
Standard bayesian learning theory does just fine learning about acting, along with learning about everything else.
Standard bayesian decision theory offers a theory of acting based on that information.
Granted, standard bayesian theory has the agent learning about everything, regardless of its usefulness, rather than learning specifically those things which help it act. This is because standard Bayesian theory assumes sufficient processing power to fully update beliefs. However, I am unaware of any PP theory which improves on this state of affairs. Free-energy-minimization models can help deal with limited processing power by variational bayesian inference, but this minimizes the error of all beliefs, rather than providing a tool to specifically focus on those beliefs which will be useful for action (again, to my knowledge). Practical bayesian inference has some tools for focusing inference on the most useful parts, but I have never seen those tools especially associated with PP theory.
I’ve already mentioned some ways in which I think the PP treatment of explore/exploit is not a particularly good one. I think machine learning research has generated much better tools.
This is the perceptual part of PP theory, which I have few issues with.
This is one part of perceptual PP which I do have an issue with. I have often read PP accounts of attention with some puzzlement.
PP essentially models perception as one big bayesian network with observations at the bottom and very abstract ideas at the top—which, fair enough. Attention is then modeled as a process which focuses inference on those parts of the network experiencing the most discordance between the top-down predictions and the bottom-up observations. This algorithm makes a lot of sense: there are similar algorithms in machine learning, for focusing belief propagation on the points where it is currently most needed, in order to efficiently propagate large changes across the network before we do any fine-tuning by propagating smaller, less-likely-to-be-important changes. (Why would the brain, a big parallel machine, need such an optimization? Why not propagate all the messages at once, in parallel? Because, biologically, we want to conserve resources. Areas of the brain which are doing more thinking actively consume more oxygen from the blood. Thinking hard is exhausting because it literally takes more energy.)
So far so good.
The problem is, this does not explain conscious experience of attention. I think people are conflating this kind of processing prioritization with conscious experience. They see this nice math of “surprise” in bayesian networks (IE, discordance between bottom-up and top-down messages), and without realizing it, they form a mental image of a humunculus sitting outside the bayesian network and looking at the more surprising regions. (Because this reflects their internal experience pretty well.)
So, how can we get a similar picture without the humunculus?
One theory is that conscious experience is a global workspace which many areas in the brain have fast access to, for the purpose of quickly propagating information that is important to a lot of processes in the brain. I think this theory is a pretty good one. But this is very different from the bayes-net-propagation-prioritization picture. This LW post discusses the discordance.
This isn’t so much a strike against the PP picture of attention (it seems quite possible something like the PP mechanism is present), as a statement that there’s also something else going on—another distinct attention mechanism, which isn’t best understood in PP terms. Maybe which isn’t best understood in terms of a big bayes net, either, since it doesn’t really make sense for a big bayes net to have a global workspace.
If we imagine that the neocortex is more or less a big bayes net (with cortical columns as nodes), and the rest of the brain is (among other things, perhaps) an RL agent which utilizes the neocortex as its model, then this secondary attention mechanism is like a filter which determines which information gets from the neocortex to the RL agent. It can, of course, use the PP notion of attention as a strong heuristic determining how to filter information. I don’t think this necessarily captures everything that’s going on, but it is, in my opinion, better than the pure PP model.
I don’t want to get mired down in discussing the details of predictive processing (least of all, the details of Friston’s free energy). Feel welcomed to express any specific points you have, by all means. (I’d love a point by point response!!) But what I would really like to know is why you are interested in predictive processing in the first place. All the potential reasons I see seem to be based on empty promises. Yet, PP fans seems to think the ideas will eventually bear fruit. What heuristic is behind this positive expectation? Why are the ideas so promising? What’s so exciting about what you’ve seen? What are the deep generators?
There’s a whole lot to respond to here, and it may take the length of Surfing Uncertainty to do so. I’ll point instead to one key dimension.
You’re discussing PP as a possible model for AI, whereas I posit PP as a model for animal brains. The main difference is that animal brains are evolved and occur inside bodies.
Evolution is the answer to the dark room problem. You come with prebuilt hardware that is adapted a certain adaptive niche, which is equivalent to modeling it. Your legs are a model of the shape of the ground and the size of your evolutionary territory. Your color vision is a model of berries in a bush, and your fingers that pick them. Your evolved body is a hyperprior you can’t update away. In a sense, you’re predicting all the things that are adaptive: being full of good food, in the company of allies and mates, being vigorous and healthy, learning new things. Lying hungry in a dark room creates a persistent error in your highest-order predictive models (the evolved ones) that you can’t change.
Your evolved prior supposes that you have a body, and that the way you persist over time is by using that body. You are not a disembodied agent learning things for fun or getting scored on some limited test of prediction or matching. Everything your brain does is oriented towards acting on the world effectively.
You can see that perception and action rely on the same mechanism in many ways, starting with the simple fact that when you look at something you don’t receive a static picture, but rather constantly saccade and shift your eyes, contract and expand your pupil and cornea, move your head around, and also automatically compensate for all of this motion. None of this is relevant to an AI who processes images fed to it “out of the void”, and whose main objective function is something other than maintaining homeostasis of a living, moving body.
Zooming out, Friston’s core idea is a direct consequence of thermodynamics: for any system (like an organism) to persist in a state of low entropy (e.g. 98°F) in an environment that is higher entropy but contains some exploitable order (e.g. calories aren’t uniformly spread in the universe but concentrated in bananas), it must exploit this order. Exploiting it is equivalent to minimizing surprise, since if you’re surprised there some pattern of the world that you failed to make use of (free energy).
Now if you just apply this basic principle to your genes persisting over an evolutionary time scale and your body persisting over the time scale of decades and this sets the stage for PP applied to animals.
For more, here’s a conversation between Clark, Friston, and an information theorist about the Dark Room problem.
I haven’t yet understood the mathematical details of Friston’s arguments. I’ve been told that some of them are flawed. But it’s plausible to me that the particular mathematical argument you’re pointing at here is OK. However, I doubt the conclusion of that argument would especially convince me that the brain is set up with the particular sort of architecture described by PP. This, it seems to me, gets into the domain of PP as a theoretical model of ideal agency as opposed to a specific neurological hypothesis.
Humans did not perfectly inherit the abstract goals which would have been most evolutionary beneficial. We are not fitness-maximizers. Similarly, even if all intelligent beings need to avoid entropy in order to keep living, that does not establish that we are entropy-minimizers at the core of our motivation system. As per my sibling comment, that’s like looking at a market economy and concluding that everyone is a money-maximizer. It’s not a necessary supposition, because we can also explain everyone’s money-seeking behavior by pointing out that money is very useful.
How does this suggest that perception and action rely on the same mechanism, as opposed to are very intertwined? I would certainly agree that motor control in vision has tight feedback loops with vision itself. What I don’t believe is that we should model this as acting so as to minimize prediction loss. For one thing, I’ve read that a pretty good model of saccade movement patterns is that we look at the most surprising parts of the image, which would be better-modeled by moving eyes so as to maximize predictive loss.
Babies look longer at objects which they find surprising, as opposed to those which they recognize.
It’s true that PP can predict some behaviors like this, because you’d do this in order to learn, so that you minimize future prediction error. But that doesn’t mean PP is helping us predict those eye movements.
In a world dependent on money, a money-minimizing person might still have to obtain and use money in order to survive and get to a point where they can successfully do without money. That doesn’t mean we can look at money-seeking behavior and conclude that a person is a money-minimizer. More likely that they’re a money-maximizer. But they could be any number of things, because in this world, you have to deal with money in a broad variety of circumstances.
Let me briefly sketch an anti-PP theory. According to what you’ve said so far, I understand you as saying that we act in a way which minimizes prediction error, but according to a warped prior which doesn’t just try to model reality statistically accurately, but rather, increases the probability of things like food, sex, etc in accordance with their importance (to evolutionary fitness). This causes us to seek those things.
My anti-PP theory is this: we act in a way which maximizes prediction error, but according to a warped prior which doesn’t just model reality statistically accurately, but rather, decreases the probability of things like food, sex, etc in accordance with their importance. This causes us to seek those things.
I don’t particularly believe anti-PP, but I find it to be more plausible than PP. It fits human behavior better. It fits eye saccades better. (The eye hits surprising parts of the image, plus sexually significant parts of the image. It stands to reason that sexually significant images are artificially “surprising” to our visual system, making them more interesting.) It fits curiosity and play behavior better.
By the way, I’m actually much more amenable to the version of PP in Kaj Sotala’s post on craving, where warping epistemics by forcing belief in success is just one motivation among several in the brain. I do think something similar to that seems to happen, although my explanation for it is much different (see my earlier comment). I just don’t buy that this is the basic action mechanism of the brain, governing all our behavior, since it seems like a large swath of our behavior is basically the opposite of what you’d expect under this hypothesis. Yes, these predictions can always be fixed by sufficiently modifying the prior, forcing the “pursuing minimal prediction error” hypothesis to line up with the data we see. However, because humans are curious creatures who look at surprising things, engage in experimental play, and like to explore, you’re going to have to take a sensible probability distribution and just about reverse the probabilities to explain those observations. At that point, you might as well switch to anti-PP theory.
So, for your project of re-writing rationality in PP, would PP constitute a model of human irrationality, and how to rectify it, in contrast to ideal rationality (which would not be well-described by PP)?
Or would you employ PP both as a model which explains human irrationality and as an ideal rationality notion, so that we can use it both as the framework in which we describe irrationality and as the framework in which we can understand what better rationality would be?
Am I right in inferring from this that your preferred version of PP is one where we explicitly plan to minimize prediction error, as opposed to the Active Inference model (which instead minimizes KL divergence)? Or do you endorse an Active Inference type model?
This explanation in terms of evolution makes the PP theory consistent with observations, but does not give me a reason to believe PP. The added complexity to the prior is similar to the added complexity of other kinds of machinery to implement drives, so as yet I see no reason to prefer this explanation to other possibly explanations of what’s going on in the brain.
My remarks about problems with different versions of PP can each be patched in various ways; these are not supposed to be “gotcha” arguments in the sense of “PP can’t explain this! / PP can’t deal with this!”. Rather, I’m trying to boggle at why PP looks promising in the first place, as a hypothesis to raise to our attention.
Each of the arguments I mentioned are about one way I might see that someone might think PP is doing some work for us, and why I don’t see that as a promising avenue.
So I remain curious what the generators of your view are.
I suspect some of the things that you want to use PP for, I would rather use my machine-learning model of meditation. The basic idea is that we are something like a model-based RL agent, but (pathologically) have some control over our attention mechanism. We can learn what kind of attention patterns are more useful. But we can also get our attention patterns into self-reinforcing loops, where we attend to the things which reinforce those attention patterns, and not things which punish them.
For example, when drinking too much, we might resist thinking about how we’ll hate ourselves tomorrow. This attention pattern is self-reinforcing, because it lets us drink more (yay!), while refusing to spend the necessary attention to propagate the negative consequences which might stop that behavior (and which would also harm the attention pattern). All our hurting tomorrow won’t de-enforce the pattern very effectively, because that pattern isn’t very active to be de-enforced, tomorrow. (RL works by propagating expected pain/pleasure shortly after we do things—it can achieve things on long time horizons because the expected pain/pleasure includes expectations on long time horizons, but the actual learning which updates an action only happens soon after we take that action.)
Wishful thinking works by avoiding painful thoughts. This is a self-reinforcing attention pattern for the same reason: if we avoid painful thoughts, we in particular avoid propagating the negative consequences of avoiding painful thoughts. Avoiding painful thoughts feels useful in the moment, because pain is pain. But this causes us to leave that important paperwork in the desk drawer for months, building up the problem, making us avoid it all the more. The more successful we are at not noticing it, the less the negative consequences propagate to the attention pattern which is creating the whole problem.
I have a weaker story for confirmation bias. Naturally, confirming a theory feels good, and getting disconfirmation feels bad. (This is not because we experience the basic neural feedback of perceptual PP as pain/pleasure, which would make us seek predictability and avoid predictive error—I don’t think that’s true, as I’ve discussed at length. Rather, this is more of a social thing. It feels bad to be proven wrong, because that often has negative consequences, especially in the ancestral environment.)
So attention patterns (and behavior patterns) which lead to being proven right will be reinforced. This is effectively one of those pathological self-reinforcing attention patterns, since it avoids its own disconfirmation, and hence, avoids propagating the consequences which would de-enforce it.
I would predict confirmation bias is strongest when we have every social incentive to prove ourselves right.
However, I doubt my story is the full story of confirmation bias. It doesn’t really explain performance in the task where you have to flip over cards to check whether “every vowel has an even number on the other side” or such things.
In any case, my theory is very much a just-so story which I contrived. Take with heap of salt.
Quoting from that, and responding:
I would clarify that #1 and #2 happen together. Given a large difference between prediction and observation, a confident prediction somewhat overwrites the perception (which helps us deal with noisy data), but the prediction is weakened, too.
And #3 is, of course, something I argued against in my other reply.
Right, this makes sense.
Why do you believe this?
I can believe that, in social circumstances, people act so as to make their predictions get confirmed, because this is important to group status. For example, (subconsciously) socially engineering a situation where the cyan-skinned person is trapped in a catch 22, where no matter what they do, you’ll be able to fit it into your narrative.
What I don’t believe in is a general mechanism whereby you act so as to confirm your predictions.
I already stated several reasons in my other comment. First, this does not follow easily from the bayes-net-like mechanisms of perceptual PP theory. They minimize prediction error in a totally different sense, reactively weakening parts of models which resulted in poor predictions, and strengthening models which had strong predictions. This offers no mechanism by which actions would be optimized in a way such that we proactively minimize prediction error thru our actions.
Second, it doesn’t fit, by and large, with human behavior. Humans are curious infovores; a better model would be that we actively plan to maximize prediction error, seeking out novel stimulus by steering toward parts of the state-space where our current predictive ability is poor. (Both of these models are poor, but the information-loving model is better.) Give a human a random doodad and they’ll fiddle with it by doing things to see what will happen. I think people make a sign error, thinking PP predicts info-loving behavior because this maximizes learning, which intuitively might sound like minimizing prediction error. But it’s quite the opposite: maximizing learning means planning to maximize prediction error.
Third, the activity of any highly competent agent will naturally be highly predictable to that agent, so it’s easy to think that it’s “minimizing prediction error” by following probable lines of action. This explains away a lot of examples of “minimizing prediction error”, in that we don’t need to posit any separate mechanism to explain what’s going on. A highly competent agent isn’t necessarily actively minimizing prediction error, just because it’s managed to steer things into a predictable state. It’s got other goals.
Furthermore, anything which attempts to maintain any kind of homeostasis will express behaviors which can naturally be described as “reducing errors”—we put on a sweater when it’s too cold, take it off when it’s too hot, etc. If we’re any good at maintaining our homeostasis, this broadly looks sorta like minimizing prediction error (because statistically, we’re typically closer to our homeostatic set point), but it’s not.
I consider this to be on shaky grounds. Perceptual PP theory is abstracted from the math of bayesian networks, which avoid self-reinforcing beliefs like this. As I mentioned earlier, #1 and #2 happen simultaneously. So the top-down theories should weaken, even as they impose themselves tyrannically on perception. A self-reinforcing feedback loop requires a more complicated explanation.
On the other hand, this can happen in loopy bayesian networks, when approximate inference is done via loopy belief prop. For example, there’s a formal result that Gaussian bayes nets end up with the correct mean-value beliefs, but with too high confidence.
So, maybe.
But loopy belief prop is just one approximate inference method for bayes nets, and it makes sense that evolution would fine-tune the inference of the brain to perform quite well at perceptual tasks. This could include adjustments to account for the predictable biases of loopy belief propagation, EG artificially decreasing confidence to make it closer to what it should be.
My point isn’t that you’re outright wrong about this one, it just seems like it’s not a strong prediction of the model.
I had understood (via one-sentence summary, so lossy in the extreme) that this was approximately how motor control worked. Is this a wrong understanding? If not, what separates the motor control mechanism from the perception mechanism?