Eli Sennesh comments on God Help Us, Let’s Try To Understand Friston On Free Energy

Eli Sennesh 6 Mar 2018 2:48 UTC
17 points
Hi,
I now work in a lab allied to both the Friston branch of neuroscience, and the probabilistic modeling branch of computational cognitive science, so I now feel even more arrogant enough to comment fluently.
I’m gonna leave a bunch of comments over the day as I get the spare time to actually respond coherently to stuff.
The first thing is that we have to situate Friston’s work in its appropriate context of Marr’s Three Levels of cognitive analysis: computational (what’s the target?), algorithmic (how do we want to hit it?), and implementational (how do we make neural hardware do it?).
Friston’s work largely takes place at the algorithmic and implementational levels. He’s answering How questions, and then claiming that they answer the What questions. This is rather like unto, as often mentioned, formulating Hamiltonian Mechanics and saying, “I’m solved physics by pointing out that you can write any physical system in terms of differential equations for its conserved quantities.” Well, now you have to actually write out a real physical system in those terms, don’t you? What you’ve invented is a rigorous language for talking about the things you aim to explain.
The free-energy principle should be thought of like the “supervised loss principle”: it just specifies what computational proxy you’re using for your real goal. It’s as rigorous as using probabilistic programming to model the mind (caveat: one of my advisers is a probabilistic programming expert).
Now, my seminar is about to start soon, so I’ll try to type up a really short step-by-step of how we get to active inference. Let’s assume the example where I want to eat my nice slice of pizza, and I’ll try to type something up about goals/motivations later on. Suffice to say, since “free-energy minimization” is like “supervised loss minimization” or “reward maximization”, it’s meaningless to say that motivation is specified in free-energy terms. Of course it can be: that’s a mathematical tautology. Any bounded utility/reward/cost function can be expressed as a probability, and therefore a free-energy — this is the Complete Class Theorem Friston always cites, and you can make it constructive using the Boltzmann Distribution (the simplest exponential family) for energy functions.
1) Firstly, free-energy is just the negative of the Evidence Lower Bound (ELBO) usually maximized in variational inference. You take a $P$ (a model of the world whose posterior you want to approximate), and a $Q$ (a model that approximates it), and you optimize the variational parameters (the parameters with no priors or conditional densities) of $Q$ by maximizing the ELBO, to get a good approximation to $P (H | D)$ (probability of hypotheses, given data). This is normal and understandable and those of us who aren’t Friston do it all the time.
2) Now you add some variables to $P$ : the body’s proprioceptive states, its sense of where your bones are and what your muscles are doing. You add a $P (D^{'} = b o n e s a n d m u s c l e s)$ , with some conditional $P (D | D^{'})$ to show how other senses depend on body position. This is already really helpful for pure prediction, because it helps you factor out random noise or physical forces acting on your body from your sensory predictions to arrive at a coherent picture of the world outside your body. You now have $P (D | D^{'}) P (D^{'} | H)$ .
3) For having new variables in the posterior, $P (H | D^{'}, D)$ , you now need some new variables in $Q$ . Here’s where we get the interesting insight of active inference: if the old $P (H | s e n s o r y D)$ was approximated as $Q (s t u f f H; s e n s o r y D)$ , we can now expand to $Q (s t u f f H; s e n s o r y D, m o t o r M)$ . Instead of inferring a parameter that approximates the proprioceptive state, we infer a parameter that can “compromise” with it: the actual body moves to accommodate $M$ as much as possible, while $M$ also adjusts itself to kinda suit what the body actually did.
Here’s the part where I’m really simplifying what stuff does, to use more of a planning as inference explanation than “pure” active inference. I could talk about “pure” active inference, but it’s too fucking complicated and badly-written to get a useful intuition. Friston’s “pure” active inference papers often give models that would have very different empirical content from each-other, but which all get optimized using variational inference, so he kinda pretends they’re all the same. Unfortunately, this is something most people in neuroscience or cognitive science do to simplify models enough to fit one experiment well, instead of having to invent a cognitive architecture that might fit all experiments badly.
4) So now, if I set a goal by clamping some variables in $P (g o a l s t u f f H = p i z z a)$ (or by imposing “goal” priors on them, clamping them to within some range of values with noise), I can’t really just optimize $Q (s t u f f H)$ to fit the new clamped model. $Q (s t u f f H)$ is really $Q (s t u f f H; s e n s o r y D, m o t o r M)$ , and $Q (s e n s o r y D)$ has to approximate $P (s e n s o r y D)$ . Instead, I can only optimize $Q (m o t o r M | g o a l s t u f f H = p i z z a)$ to fit $P (b o d y p o s i t i o n D^{'} | g o a l s t u f f H = p i z z a)$ . Actually doing so reaches a “Bayes-optimal” compromise between my current bodily state and really moving. Once $Q$ already carries a good dynamical model (through time) of how my body and senses move (trajectories through time), changing $M$ as a function of time lets me move as I please, even assuming my actual movements may be noisy with respect to my motor commands.
That’s really all “active inference” is: variational inference with body position as a generative parameter, and motor commands as the variational parameter approximating it. You set motor commands to get the body position you want, then body position changes noisily based on motor commands. This keeps getting done until the ELBO is maximized/free-energy minimized, and now I’m eating the pizza (as a process over time).
- do7777 14 Sep 2018 22:10 UTC
  1 point
  Parent
  (point 2) Why e (D|D′)P(D′|H) and not P(D|D′,H)P(D′|H)?