abramdemski comments on Review: LessWrong Best of 2018 – Epistemology

abramdemski 30 Dec 2020 21:46 UTC
3 points
PP is not one thing. This makes it very difficult for me to say what I don’t like about it, since no one element seems to be necessarily present in all the different versions. What follows are some remarks about specific ideas I’ve seen associated with PP, many of them contradictory. Do let me know which ideas you endorse / don’t endorse.
It is also possible that each of my points is based on a particular misconception about PP. While I’ve made some effort to be well-informed about PP, I have not spent so much time on it, so my understanding is definitely shallow.
The three main meanings of PP (each of which is a cluster, containing many many different sub-meanings, as you flesh out the details in different ways):
- A theory of perception. If you look PP up on Wikipedia, the term primarily refers to a theory of perceptual processing in which prediction plays a central role, and observations interact with predictions to provide a feedback signal for learning. So, the theory is that perception is fundamentally about minimizing prediction error. I basically believe this theory. So let’s set it aside.
- A theory of action. Some people took the idea “the brain minimizes prediction error” and tried to apply it to motor control, too—and to everything else in the brain. I think this kind of made sense as a thing to try (unifying these two things is a worthwhile goal!), but doesn’t go anywhere. I’ll have a lot to say about this. This theory is what I’ll mean when I say PP—it is, in my experience, what rationalists and rationalist-adjacent people primarily mean by “PP”.
- A theory of everything. Friston’s free-energy principle. This is not only supposed to apply to the human brain, but also evolution, and essentially any physical system. I have it on good authority that the math in Friston’s papers is full of errors, and no one who has been excited about this (that I’ve seen) has also claimed to understand it.
1. You have 3 ways of avoiding prediction error: updating your models, changing your perception, acting on the world. Those are always in play and you often do all three in some combination (see my model of confirmation bias in action).
The PP theory of perception says that the brain “minimizes prediction error” in the sense that it is always engaged in the business of predicting, and compares the predictions to observations in order to generate feedback. This could be like gradient descent, or like Bayesian updates.
Actively planning to minimize prediction error, or learning policies which minimize prediction error, is a totally different thing which requires different mechanisms.
Consider that minimizing prediction error in the sense required for prediction means making each individual prediction as accurate as possible, which means, being totally myopic. An error on a specific prediction means making an adjustment on that specific prediction. The credit assignment problem is easily solved; we know exactly what led to that specific prediction, so we can propagate all the relevant errors and make the necessary adjustments.
On the other hand, with planning and policy learning, there is a nontrivial (indeed, severe) credit assignment problem. We don’t know which outputs lead to which error signals later. Therefore, we need an entirely different learning mechanism. Indeed, as I argued in The Credit Assignment Problem, we basically need a world model in order to assign credit. This makes it very hard to unify the theory of perception with the theory of action, because one needs the other as input!
In any case, why do you want to suppose that humans take actions in a way which minimizes prediction error? I think this is a poor model. There’s the standard “dark room problem” objection: if humans wanted to minimize prediction error, they would like sensory deprivation chambers a whole lot more than they seem to. Instead, humans like to turn on the radio, watch TV, read a book, etc when they don’t have anything else to do. Simply put, we are curious creatures, who do not like being bored. Yes, we also don’t like too much excitement of the wrong kind, but we are closer to infophillic than infophobic! And this makes sense from an evolutionary perspective. Machine learning has found that reinforcement learning agents do better when you have a basic mechanism to encourage exploration, because it’s easy to under-explore, but hard to properly explore. One way to do this is to actively reinforce prediction error; IE, the agents are actually maximizing prediction error! (as one component of more complicated values, perhaps.)
I’ve seen PP blog posts take this in stride, explaining that it’s important to explore in order to get better at doing things so that you can minimize prediction error later. I’ve seen technical derivations of a “curiosity drive” on this premise. And sure, that’s technically true. But that doesn’t change that you’re postulating a drive which discourages exploration, all things considered, when it’s more probable (based on parallels with RL) that evolution would add a drive to explicitly encourage exploration.
Perhaps this is part of why one of the most common PP formalisms doesn’t actually propose to minimize prediction error in either of the two above senses (IE, correcting predictions via feedback, or taking actions which make future prediction error less).
The primary theoretical tool by which PP seeks to explain action is active inference. According to this method, we can select actions by first conditioning on our success, and then sampling actions from that distribution. I sometimes see this justified as a practical way to leverage inference machinery to make decisions. We can judge that on its pragmatic merits. (I think it’s not common to use it purely to get the job done—techniques such as reinforcement learning mostly work better.) Other times, I’ve heard it associated with the idea that people can’t conceive of failure (particularly true failure of core values), or with other forms of wishful thinking.
My first complaint is that this is usually not different enough from standard Bayesian decision theory to account for the biases it purports to predict. For example, to plan to avoid death, you have to start with a realistic world-model which includes all the ways you could die, and then condition on not dying, and then sample actions from that.
In what sense are you “incapable of conceiving of death” if your computations manage to successfully identify potential causes of death and create plans which avoid them?
In what sense are you engaging in wishful thinking, if your planning algorithms work?
One might say: “The psychological claim of wishful thinking isn’t that humans fail to take disaster into account when they plan; the claim is, rather, that humans plan while inhabiting a psychological perspective in which they can’t fail. This lines up with the idea of sampling from the probability distribution in which failure isn’t an option.”
But this is too extreme. It’s true, when I idly muse about the future, I have a tendency to exclude my own death from it. Yet, I have a visceral fear of heights. When I am near the edge of a cliff, I feel like I am going to fall off and die. This image loops repeatedly even though it has never happened to me and my probability of taking a few steps forward and falling is very low. (It’s a fascinating experience: I often stand near ledges on purpose to experience the strong, visceral, unshakable belief that I’m about to fall, which fails to update on all evidence to the contrary.) If I were simply cognizing in the probability distribution which excludes death, I would avoid ledges and cliffs without thinking explicitly about the negative consequences.
And humans are quite capable of explicitly discussing the possibility of death, too.
My second issue with planning by inference is that it also introduces new biases—strange, inhuman biases.
In particular, a planning-by-inference agent cannot conceive of novel, complicated plans which achieve its goals. This is because updating on success doesn’t shift you from your prior as much as it should.
Suppose there is a narrow walkway across an abyss. You are a video game character: you have four directions you can walk (N, S, E, W) at any time. To get across the walkway, you have to go N thirty times in a row.
There are two ways to achieve success: you can open the chest next to you, which achieves success 10% of the time, and otherwise, results in the walkway disappearing. Or, you can cross the walkway, and open the box on the other side. This results in success 100% of the time. You know all of this.
Bayesian decision theory would recommend crossing the walkway.
Planning by inference will almost always open the nearby chest instead.
To see why, remember than we update on our prior. Since we don’t already know the optimal plan, our prior on actions is an even distribution between N, S, E and W at all time-steps. This means crossing the walkway has a prior probability of approximately $10^{- 20}$ . Updating this prior on success, we find that it’s far more probable that we’ll succeed by opening the nearby chest.
Technical aside—the sense in which planning by inference minimizes prediction error is: it minimizes KL divergence between its action distribution and the distribution conditioned on success. (This is just a fancy way of saying you’re doing your best to match those probabilities.) It’s important to keep in mind that this is vaaastly different from actively planning to avoid prediction error. There is no “dark room problem” here. Indeed, planning-by-inference encourages exploration, rather than suppressing it—perhaps to the point of over-exploring (because planning-by-inference agents continue to use sub-optimal plans with frequency proportional to their probability of success, long after they’ve fully explored the possibilities).
2. Action is key, and it shapes and is shaped by perception. The map you build of any territory is prioritized and driven by the things you can act on most effectively. You don’t just learn “what is out there” but “what can I do with it”.
How are you comparing standard bayesian thinking with PP, such that PP comes out ahead in this respect?
- Standard bayesian learning theory does just fine learning about acting, along with learning about everything else.
- Standard bayesian decision theory offers a theory of acting based on that information.
- Granted, standard bayesian theory has the agent learning about everything, regardless of its usefulness, rather than learning specifically those things which help it act. This is because standard Bayesian theory assumes sufficient processing power to fully update beliefs. However, I am unaware of any PP theory which improves on this state of affairs. Free-energy-minimization models can help deal with limited processing power by variational bayesian inference, but this minimizes the error of all beliefs, rather than providing a tool to specifically focus on those beliefs which will be useful for action (again, to my knowledge). Practical bayesian inference has some tools for focusing inference on the most useful parts, but I have never seen those tools especially associated with PP theory.
3. You care about prediction over the lifetime scale, so there’s an explore/exploit tradeoff between potentially acquiring better models and sticking with the old ones.
I’ve already mentioned some ways in which I think the PP treatment of explore/exploit is not a particularly good one. I think machine learning research has generated much better tools.
4. Prediction goes from the abstract to the detailed. You perceive specifics in a way that aligns with your general model, rarely in contradiction.
5. Updating always goes from the detailed to the abstract. It explains Kuhn’s paradigm shifts but for everything — you don’t change your general theory and then update the details, you accumulate error in the details and then the general theory switches all at once to slot them into place.
6. In general, your underlying models are a distribution but perception is always unified, whatever your leading model is. So when perception changes it does so abruptly.
This is the perceptual part of PP theory, which I have few issues with.
7. Attention is driven in a Bayesian way, to the places that are most likely to confirm/disconfirm your leading hypothesis, balancing the accuracy of perceiving the attended detail correctly and the leverage of that detail to your overall picture.
This is one part of perceptual PP which I do have an issue with. I have often read PP accounts of attention with some puzzlement.
PP essentially models perception as one big bayesian network with observations at the bottom and very abstract ideas at the top—which, fair enough. Attention is then modeled as a process which focuses inference on those parts of the network experiencing the most discordance between the top-down predictions and the bottom-up observations. This algorithm makes a lot of sense: there are similar algorithms in machine learning, for focusing belief propagation on the points where it is currently most needed, in order to efficiently propagate large changes across the network before we do any fine-tuning by propagating smaller, less-likely-to-be-important changes. (Why would the brain, a big parallel machine, need such an optimization? Why not propagate all the messages at once, in parallel? Because, biologically, we want to conserve resources. Areas of the brain which are doing more thinking actively consume more oxygen from the blood. Thinking hard is exhausting because it literally takes more energy.)
So far so good.
The problem is, this does not explain conscious experience of attention. I think people are conflating this kind of processing prioritization with conscious experience. They see this nice math of “surprise” in bayesian networks (IE, discordance between bottom-up and top-down messages), and without realizing it, they form a mental image of a humunculus sitting outside the bayesian network and looking at the more surprising regions. (Because this reflects their internal experience pretty well.)
So, how can we get a similar picture without the humunculus?
One theory is that conscious experience is a global workspace which many areas in the brain have fast access to, for the purpose of quickly propagating information that is important to a lot of processes in the brain. I think this theory is a pretty good one. But this is very different from the bayes-net-propagation-prioritization picture. This LW post discusses the discordance.
This isn’t so much a strike against the PP picture of attention (it seems quite possible something like the PP mechanism is present), as a statement that there’s also something else going on—another distinct attention mechanism, which isn’t best understood in PP terms. Maybe which isn’t best understood in terms of a big bayes net, either, since it doesn’t really make sense for a big bayes net to have a global workspace.
If we imagine that the neocortex is more or less a big bayes net (with cortical columns as nodes), and the rest of the brain is (among other things, perhaps) an RL agent which utilizes the neocortex as its model, then this secondary attention mechanism is like a filter which determines which information gets from the neocortex to the RL agent. It can, of course, use the PP notion of attention as a strong heuristic determining how to filter information. I don’t think this necessarily captures everything that’s going on, but it is, in my opinion, better than the pure PP model.
I don’t want to get mired down in discussing the details of predictive processing (least of all, the details of Friston’s free energy). Feel welcomed to express any specific points you have, by all means. (I’d love a point by point response!!) But what I would really like to know is why you are interested in predictive processing in the first place. All the potential reasons I see seem to be based on empty promises. Yet, PP fans seems to think the ideas will eventually bear fruit. What heuristic is behind this positive expectation? Why are the ideas so promising? What’s so exciting about what you’ve seen? What are the deep generators?
- Jacob Falkovich 1 Jan 2021 3:10 UTC
  2 points
  Parent
  There’s a whole lot to respond to here, and it may take the length of Surfing Uncertainty to do so. I’ll point instead to one key dimension.
  
  You’re discussing PP as a possible model for AI, whereas I posit PP as a model for animal brains. The main difference is that animal brains are evolved and occur inside bodies.
  Evolution is the answer to the dark room problem. You come with prebuilt hardware that is adapted a certain adaptive niche, which is equivalent to modeling it. Your legs are a model of the shape of the ground and the size of your evolutionary territory. Your color vision is a model of berries in a bush, and your fingers that pick them. Your evolved body is a hyperprior you can’t update away. In a sense, you’re predicting all the things that are adaptive: being full of good food, in the company of allies and mates, being vigorous and healthy, learning new things. Lying hungry in a dark room creates a persistent error in your highest-order predictive models (the evolved ones) that you can’t change.
  Your evolved prior supposes that you have a body, and that the way you persist over time is by using that body. You are not a disembodied agent learning things for fun or getting scored on some limited test of prediction or matching. Everything your brain does is oriented towards acting on the world effectively.
  You can see that perception and action rely on the same mechanism in many ways, starting with the simple fact that when you look at something you don’t receive a static picture, but rather constantly saccade and shift your eyes, contract and expand your pupil and cornea, move your head around, and also automatically compensate for all of this motion. None of this is relevant to an AI who processes images fed to it “out of the void”, and whose main objective function is something other than maintaining homeostasis of a living, moving body.
  
  Zooming out, Friston’s core idea is a direct consequence of thermodynamics: for any system (like an organism) to persist in a state of low entropy (e.g. 98°F) in an environment that is higher entropy but contains some exploitable order (e.g. calories aren’t uniformly spread in the universe but concentrated in bananas), it must exploit this order. Exploiting it is equivalent to minimizing surprise, since if you’re surprised there some pattern of the world that you failed to make use of (free energy).
  Now if you just apply this basic principle to your genes persisting over an evolutionary time scale and your body persisting over the time scale of decades and this sets the stage for PP applied to animals.
  For more, here’s a conversation between Clark, Friston, and an information theorist about the Dark Room problem.
  - abramdemski 2 Jan 2021 3:09 UTC
    3 points
    Parent
    Zooming out, Friston’s core idea is a direct consequence of thermodynamics: for any system (like an organism) to persist in a state of low entropy (e.g. 98°F) in an environment that is higher entropy but contains some exploitable order (e.g. calories aren’t uniformly spread in the universe but concentrated in bananas), it must exploit this order. Exploiting it is equivalent to minimizing surprise, since if you’re surprised there some pattern of the world that you failed to make use of (free energy).
    I haven’t yet understood the mathematical details of Friston’s arguments. I’ve been told that some of them are flawed. But it’s plausible to me that the particular mathematical argument you’re pointing at here is OK. However, I doubt the conclusion of that argument would especially convince me that the brain is set up with the particular sort of architecture described by PP. This, it seems to me, gets into the domain of PP as a theoretical model of ideal agency as opposed to a specific neurological hypothesis.
    Humans did not perfectly inherit the abstract goals which would have been most evolutionary beneficial. We are not fitness-maximizers. Similarly, even if all intelligent beings need to avoid entropy in order to keep living, that does not establish that we are entropy-minimizers at the core of our motivation system. As per my sibling comment, that’s like looking at a market economy and concluding that everyone is a money-maximizer. It’s not a necessary supposition, because we can also explain everyone’s money-seeking behavior by pointing out that money is very useful.
  - abramdemski 2 Jan 2021 2:53 UTC
    3 points
    Parent
    You can see that perception and action rely on the same mechanism in many ways, starting with the simple fact that when you look at something you don’t receive a static picture, but rather constantly saccade and shift your eyes, contract and expand your pupil and cornea, move your head around, and also automatically compensate for all of this motion.
    How does this suggest that perception and action rely on the same mechanism, as opposed to are very intertwined? I would certainly agree that motor control in vision has tight feedback loops with vision itself. What I don’t believe is that we should model this as acting so as to minimize prediction loss. For one thing, I’ve read that a pretty good model of saccade movement patterns is that we look at the most surprising parts of the image, which would be better-modeled by moving eyes so as to maximize predictive loss.
    Babies look longer at objects which they find surprising, as opposed to those which they recognize.
    It’s true that PP can predict some behaviors like this, because you’d do this in order to learn, so that you minimize future prediction error. But that doesn’t mean PP is helping us predict those eye movements.
    In a world dependent on money, a money-minimizing person might still have to obtain and use money in order to survive and get to a point where they can successfully do without money. That doesn’t mean we can look at money-seeking behavior and conclude that a person is a money-minimizer. More likely that they’re a money-maximizer. But they could be any number of things, because in this world, you have to deal with money in a broad variety of circumstances.
    Let me briefly sketch an anti-PP theory. According to what you’ve said so far, I understand you as saying that we act in a way which minimizes prediction error, but according to a warped prior which doesn’t just try to model reality statistically accurately, but rather, increases the probability of things like food, sex, etc in accordance with their importance (to evolutionary fitness). This causes us to seek those things.
    My anti-PP theory is this: we act in a way which maximizes prediction error, but according to a warped prior which doesn’t just model reality statistically accurately, but rather, decreases the probability of things like food, sex, etc in accordance with their importance. This causes us to seek those things.
    I don’t particularly believe anti-PP, but I find it to be more plausible than PP. It fits human behavior better. It fits eye saccades better. (The eye hits surprising parts of the image, plus sexually significant parts of the image. It stands to reason that sexually significant images are artificially “surprising” to our visual system, making them more interesting.) It fits curiosity and play behavior better.
    By the way, I’m actually much more amenable to the version of PP in Kaj Sotala’s post on craving, where warping epistemics by forcing belief in success is just one motivation among several in the brain. I do think something similar to that seems to happen, although my explanation for it is much different (see my earlier comment). I just don’t buy that this is the basic action mechanism of the brain, governing all our behavior, since it seems like a large swath of our behavior is basically the opposite of what you’d expect under this hypothesis. Yes, these predictions can always be fixed by sufficiently modifying the prior, forcing the “pursuing minimal prediction error” hypothesis to line up with the data we see. However, because humans are curious creatures who look at surprising things, engage in experimental play, and like to explore, you’re going to have to take a sensible probability distribution and just about reverse the probabilities to explain those observations. At that point, you might as well switch to anti-PP theory.
  - abramdemski 2 Jan 2021 0:46 UTC
    3 points
    Parent
    You’re discussing PP as a possible model for AI, whereas I posit PP as a model for animal brains. The main difference is that animal brains are evolved and occur inside bodies.
    So, for your project of re-writing rationality in PP, would PP constitute a model of human irrationality, and how to rectify it, in contrast to ideal rationality (which would not be well-described by PP)?
    Or would you employ PP both as a model which explains human irrationality and as an ideal rationality notion, so that we can use it both as the framework in which we describe irrationality and as the framework in which we can understand what better rationality would be?
    Evolution is the answer to the dark room problem. You come with prebuilt hardware that is adapted a certain adaptive niche, which is equivalent to modeling it. Your legs are a model of the shape of the ground and the size of your evolutionary territory. Your color vision is a model of berries in a bush, and your fingers that pick them. Your evolved body is a hyperprior you can’t update away. In a sense, you’re predicting all the things that are adaptive: being full of good food, in the company of allies and mates, being vigorous and healthy, learning new things. Lying hungry in a dark room creates a persistent error in your highest-order predictive models (the evolved ones) that you can’t change.
    Am I right in inferring from this that your preferred version of PP is one where we explicitly plan to minimize prediction error, as opposed to the Active Inference model (which instead minimizes KL divergence)? Or do you endorse an Active Inference type model?
    This explanation in terms of evolution makes the PP theory consistent with observations, but does not give me a reason to believe PP. The added complexity to the prior is similar to the added complexity of other kinds of machinery to implement drives, so as yet I see no reason to prefer this explanation to other possibly explanations of what’s going on in the brain.
    My remarks about problems with different versions of PP can each be patched in various ways; these are not supposed to be “gotcha” arguments in the sense of “PP can’t explain this! / PP can’t deal with this!”. Rather, I’m trying to boggle at why PP looks promising in the first place, as a hypothesis to raise to our attention.
    Each of the arguments I mentioned are about one way I might see that someone might think PP is doing some work for us, and why I don’t see that as a promising avenue.
    So I remain curious what the generators of your view are.