(Posting here rather than SSC because I wrote the whole comment in markdown before remembering that SSC doesn’t support it).
We had a guest lecture from Friston last year and I cornered him afterwards to try to get some enlightenment (notes here). I also spent the next few days working through the literature, using a multi-armed bandit bandit as a concrete problem (notes here ).
Very few of the papers have concrete examples. Those that do often skip important parts of the math and use inconsistent/ambiguous notation. He doesn’t seem to have released any of the code for his game-playing examples.
The various papers don’t all even implement the same model—the free energy principle seems to be more a design principle than a specific model.
The wikipedia page doesn’t explain much but at least uses consistent and reasonable notation.
Reinforcement learning or active inference has most of a worked model, and is the closest I’ve found to explaining how utility functions get encoded into meta-priors. It also contains:
When friends and colleagues first come across this conclusion, they invariably respond with; “but that means I should just close my eyes or head for a dark room and stay there”. In one sense this is absolutely right; and is a nice description of going to bed. However, this can only be sustained for a limited amount of time, because the world does not support, in the language of dynamical systems, stable fixed-point attractors. At some point you will experience surprising states (e.g., dehydration or hypoglycaemia). More formally, itinerant dynamics in the environment preclude simple solutions to avoiding surprise; the best one can do is to minimise surprise in the face of stochastic and chaotic sensory perturbations. In short, a necessary condition for an agent to exist is that it adopts a policy that minimizes surprise.
I am leaning towards ‘the emperor has no clothes’. In support of this:
Friston doesn’t explain things well, but nobody else seems to have produced an accessible worked example either, even though many people claim to understand the theory and think that is important.
Nobody seems to have has used this to solve any novel problems, or even to solve well-understood trivial problems.
I can’t find any good mappings/comparisons to existing models. Are there priors that cannot be represented as utility functions, or vice versa? What explore/exploit tradeoffs do free-energy models lead to, or can they encode any given tradeoff?
At this point I’m unwilling to invest any further effort into the area, but I could be re-interested if someone were to produce a python notebook or similar with a working solution for some standard problem (eg multi-armed bandit).
The various papers don’t all even implement the same model—the free energy principle seems to be more a design principle than a specific model.`
Bingo. Friston trained as a physicist, and he wants the free-energy principle to be more like a physical law than a computer program. You can write basically any computer program that implements or supports variational inference, throw in some action states as variational parameters, and you’ve “implemented” the free-energy principle _in some way_.
Overall, the Principle is more of a domain-specific language than a single unified model, more like “supervised learning” than like “this 6-layer convnet I trained for neural style transfer.”
Are there priors that cannot be represented as utility functions, or vice versa?
No. They’re isomorphic, via the Complete Class Theorem. Any utility/cost function that grows sub-super-exponentially (ie: for which Pascal’s Mugging doesn’t happen) can be expressed as a distribution, and used in the free-energy principle. You can get the intuition by thinking, “This goal specifies how often I want to see outcome X (P), versus its disjoint cousins Y and Z that I want to see such-or-so often (1-P).”
What explore/exploit tradeoffs do free-energy models lead to, or can they encode any given tradeoff?
The is actually one of the Very Good things about free-energy models: since free-energy is “Energy—Entropy”, or “Exploit + Explore”, cast in the same units (bits/nats from info theory), it theorizes a principled, prescriptive way to make the tradeoff, once you’ve specified how concentrated the probability mass is under the goals in the support set (and thus the multiplicative inverse of the exploit term’s global optimum).
We ought to be able to use this to test the Principle empirically, I think.
No. They’re isomorphic, via the Complete Class Theorem. Any utility/cost function that grows sub-super-exponentially (ie: for which Pascal’s Mugging doesn’t happen) can be expressed as a distribution, and used in the free-energy principle. You can get the intuition by thinking, “This goal specifies how often I want to see outcome X (P), versus its disjoint cousins Y and Z that I want to see such-or-so often (1-P).”
Can you please link me to more on this? I was under the impression that pascal’s mugging happens for any utility function that grows at least as fast as the probabilities shrink, and the probabilities shrink exponentially for normal probability functions. (For example: In the toy model of the St. Petersburg problem, the utility function grows exactly as fast as the probability function shrinks, resulting in infinite expected utility for playing the game.)
Also: As I understand them, utility functions aren’t of the form “I want to see X P often and Y 1-P often.” They are more like “X has utility 200, Y has utility 150, Z has utility 24...” Maybe the form you are talking about is a special case of the form I am talking about, but I don’t yet see how it could be the other way around. As I’m thinking of them, utility functions aren’t about what you see at all. They are just about the world. The point is, I’m confused by your explanation & would love to read more about this.
Can you please link me to more on this? I was under the impression that pascal’s mugging happens for any utility function that grows at least as fast as the probabilities shrink, and the probabilities shrink exponentially for normal probability functions. (For example: In the toy model of the St. Petersburg problem, the utility function grows exactly as fast as the probability function shrinks, resulting in infinite expected utility for playing the game.)
The Complete Class Theorem says that bounded cost/utility functions are isomorphic to posterior probabilities optimizing their expected values. In that sense, it’s almost a trivial result.
In practice, this just means that we can exchange the two whenever we please: we can take a probability and get an entropy to minimize, or we can take a bounded utility/cost function and bung it through a Boltzmann Distribution.
Also: As I understand them, utility functions aren’t of the form “I want to see X P often and Y 1-P often.” They are more like “X has utility 200, Y has utility 150, Z has utility 24...” Maybe the form you are talking about is a special case of the form I am talking about, but I don’t yet see how it could be the other way around. As I’m thinking of them, utility functions aren’t about what you see at all. They are just about the world. The point is, I’m confused by your explanation & would love to read more about this.
I was speaking loosely, so “I want to see X” can be taken as, “I want X to happen”. The details remain an open research problem of how the brain (or probabilistic AI) can or should cash out, “X happens” into “here are all the things I expect to observe when X happens, and I use them to gather evidence for whether X has happened, and to control whether X happens and how often”.
For a metaphor of why you’d have “probabilistic” utility functions, consider it as Bayesian uncertainty: “I have degree of belief P that X should happen, and degree of belief 1-P that something else should happen.”
One of the deep philosophical differences is that both Fristonian neurosci and Tenenbaumian cocosci assume that stochasticity is “real enough for government work”, and so there’s no point in specifying “utility functions” over “states” of the world in which all variables are clamped to fully determined values. After all, you yourself as a physically implemented agent have to generate waste heat, so there’s inevitably going to be some stochasticity (call it uncertainty that you’re mathematically required to have) about whatever physical heat bath you dumped your own waste heat into.
(That was supposed to be a reference to Eliezer’s writing on minds doing thermodynamic work (which free-energy minds absolutely do!), not a poop joke.)
Actually, here’s a much simpler, more intuitive way to think about probabilistically specified goals.
Visualize a probability distribution as a heat map of the possibility space. Specifying a probabilistic goal then just says, “Here’s where I want the heat to concentrate”, and submitting it to active inference just uses the available inferential machinery to actually squeeze the heat into that exact concentration as best you can.
When our heat-map takes the form of “heat” over dynamical trajectories, possible “timelines” of something that can move, “squeezing the heat into your desired concentration” means exactly “squeezing the future towards desired regions”. All you’re changing is how you specify desired regions: from giving them an “absolute” value (that can actually undergo any linear transformation and be isomorphic) to giving them a purely “relative” value (relative to disjoint events in your sample space).
This is fine, because after all, it’s not like you could really have an “infinite” desire for something finite-sized in the first place. If you choose to think of utilities in terms of money, the “goal probabilities” are just the relative prices you’re willing to pay for a certain outcome: you start with odds, the number of apples you’ll trade for an orange, and convert from odds to probabilities to get your numbers. It’s just using “barter” among disjoint random events instead of “currency”.
I’m confused so I’ll comment a dumb question hoping my cognitive algorithms are sufficiently similar to other LW:ers, such that they’ll be thinking but not writing this question.
“If I value apples at 3 units and oranges at 1 unit, I don’t want at 75%/25% split. I only want apples, because they’re better! (I have no diminishing returns.)”
>”If I value apples at 3 units and oranges at 1 unit, I don’t want at 75%/25% split. I only want apples, because they’re better! (I have no diminishing returns.)”
I think what I’d have to ask here is: if you only want apples, why are you spending your money on oranges? If you will not actually pay me 1 unit for an orange, why do you claim you value oranges at 1 unit?
Another construal: you value oranges at 1 orange per 1 unit because if I offer you a lottery over those and let you set the odds yourself, you will choose to set them to 50⁄50. You’re indifferent to which one you receive, so you value them equally. We do the same trick with apples and find you value them at 3 units per 1 apple.
I now offer you a lottery between receiving 3 apples and 1 orange, and I’ll let you pay 3 units to tilt the odds by one expected apple. Since the starting point was 1.5 expected apples and 0.5 expected oranges, and you insist you want only 3 expected apples and 0 expected oranges, I believe I can make you end up paying more than 3 units per apple now, despite our having established that as your “price”.
The lesson is, I think, don’t offer to pay finite amounts of money for outcomes you want literally zero of, as someone may in fact try to take you up on it.
Honestly, I’ve just had to go back and forth banging my head on Friston’s free-energy papers, non-Fristonfree-energy papers, and the ordinary variational inference literature—for the past two years, prior to which I spent three years banging my head on the Josh Tenenbaum-y computational cog-sci literature and got used to seeing probabilistic models of cognition.
I’m now really fucking glad to be in a PhD program where I can actually use that knowledge.
Oh, and btw, everyone at MIRI was exactly as confused as Scott is when I presented a bunch of free-energy stuff to them last March.
Sorry for the bold, sometimes our editor does weird things with copy-paste and bolds everything you pasted. Working on a fix for that, but it’s an external library and that’s always a bit harder than fixing our code.
Re: the “when friends and colleagues first come across this conclusion...” quote:
A world where everybody’s true desire is to rest in bed as much as possible, but where they grudgingly take the actions needed to stay alive and maintain homeostasis, seems both very imaginable, and also very different from what we observe.
Agreed. ‘Rest in bed as much as possible but grudgingly take the actions needed to stay alive’ sounds a lot like depression, but there exist non-depressed people who need explaining.
I wonder if the conversion from mathematics to language is causing problems somewhere. The prose description you are working with is ‘take actions that minimize prediction error’ but the actual model is ‘take actions that minimize a complicated construct called free energy’. Sitting in a dark room certainly works for the former but I don’t know how to calculate it for the latter.
In the paper I linked, the free energy minimizing trolleycar does not sit in the valley and do nothing to minimize prediction error. It moves to keep itself on the dynamic escape trajectory that it was trained with and so predicts itself achieving. So if we understood why that happens we might unravel the confusion.
>I wonder if the conversion from mathematics to language is causing problems somewhere. The prose description you are working with is ‘take actions that minimize prediction error’ but the actual model is ‘take actions that minimize a complicated construct called free energy’. Sitting in a dark room certainly works for the former but I don’t know how to calculate it for the latter.
There’s absolutely trouble here. “Minimizing surprise” always means, to Friston, minimizing sensory surprise under a generative model: −logp(s|m) . The problem is that, of course, in the course of constructing this, you had to marginalize out all the interesting variables that make up your generative model, so you’re really looking at −log∫r∫Ψp(s,r,Ψ|m) or something similar.
Mistaking “surprise” in this context for the actual self-information of the empirical distribution of sense-data −logp(s) makes the whole thing fall apart.
>In the paper I linked, the free energy minimizing trolleycar does not sit in the valley and do nothing to minimize prediction error. It moves to keep itself on the dynamic escape trajectory that it was trained with and so predicts itself achieving. So if we understood why that happens we might unravel the confusion.
If you look closely, Friston’s downright cheating in that paper. First he “immerses” his car in its “statistical bath” that teaches it where to go, with only perceptual inference allowed. Then he turns off perceptual updating, leaving only action as a means of resolving free-energy, and points out that thusly, the car tries to climb the mountain as active inference proceeds.
It would be interesting if anyone knows of historical examples where someone had a key insight, but nonetheless fulfilled your “emperor has no clothes” criteria.
(Posting here rather than SSC because I wrote the whole comment in markdown before remembering that SSC doesn’t support it).
We had a guest lecture from Friston last year and I cornered him afterwards to try to get some enlightenment (notes here). I also spent the next few days working through the literature, using a multi-armed bandit bandit as a concrete problem (notes here ).
Very few of the papers have concrete examples. Those that do often skip important parts of the math and use inconsistent/ambiguous notation. He doesn’t seem to have released any of the code for his game-playing examples.
The various papers don’t all even implement the same model—the free energy principle seems to be more a design principle than a specific model.
The wikipedia page doesn’t explain much but at least uses consistent and reasonable notation.
Reinforcement learning or active inference has most of a worked model, and is the closest I’ve found to explaining how utility functions get encoded into meta-priors. It also contains:
I am leaning towards ‘the emperor has no clothes’. In support of this:
Friston doesn’t explain things well, but nobody else seems to have produced an accessible worked example either, even though many people claim to understand the theory and think that is important.
Nobody seems to have has used this to solve any novel problems, or even to solve well-understood trivial problems.
I can’t find any good mappings/comparisons to existing models. Are there priors that cannot be represented as utility functions, or vice versa? What explore/exploit tradeoffs do free-energy models lead to, or can they encode any given tradeoff?
At this point I’m unwilling to invest any further effort into the area, but I could be re-interested if someone were to produce a python notebook or similar with a working solution for some standard problem (eg multi-armed bandit).
Bingo. Friston trained as a physicist, and he wants the free-energy principle to be more like a physical law than a computer program. You can write basically any computer program that implements or supports variational inference, throw in some action states as variational parameters, and you’ve “implemented” the free-energy principle _in some way_.
Overall, the Principle is more of a domain-specific language than a single unified model, more like “supervised learning” than like “this 6-layer convnet I trained for neural style transfer.”
No. They’re isomorphic, via the Complete Class Theorem. Any utility/cost function that grows sub-super-exponentially (ie: for which Pascal’s Mugging doesn’t happen) can be expressed as a distribution, and used in the free-energy principle. You can get the intuition by thinking, “This goal specifies how often I want to see outcome X (P), versus its disjoint cousins Y and Z that I want to see such-or-so often (1-P).”
The is actually one of the Very Good things about free-energy models: since free-energy is “Energy—Entropy”, or “Exploit + Explore”, cast in the same units (bits/nats from info theory), it theorizes a principled, prescriptive way to make the tradeoff, once you’ve specified how concentrated the probability mass is under the goals in the support set (and thus the multiplicative inverse of the exploit term’s global optimum).
We ought to be able to use this to test the Principle empirically, I think.
(EDIT: Dear God, why was everything bold!?)
Can you please link me to more on this? I was under the impression that pascal’s mugging happens for any utility function that grows at least as fast as the probabilities shrink, and the probabilities shrink exponentially for normal probability functions. (For example: In the toy model of the St. Petersburg problem, the utility function grows exactly as fast as the probability function shrinks, resulting in infinite expected utility for playing the game.)
Also: As I understand them, utility functions aren’t of the form “I want to see X P often and Y 1-P often.” They are more like “X has utility 200, Y has utility 150, Z has utility 24...” Maybe the form you are talking about is a special case of the form I am talking about, but I don’t yet see how it could be the other way around. As I’m thinking of them, utility functions aren’t about what you see at all. They are just about the world. The point is, I’m confused by your explanation & would love to read more about this.
The Complete Class Theorem says that bounded cost/utility functions are isomorphic to posterior probabilities optimizing their expected values. In that sense, it’s almost a trivial result.
In practice, this just means that we can exchange the two whenever we please: we can take a probability and get an entropy to minimize, or we can take a bounded utility/cost function and bung it through a Boltzmann Distribution.
I was speaking loosely, so “I want to see X” can be taken as, “I want X to happen”. The details remain an open research problem of how the brain (or probabilistic AI) can or should cash out, “X happens” into “here are all the things I expect to observe when X happens, and I use them to gather evidence for whether X has happened, and to control whether X happens and how often”.
For a metaphor of why you’d have “probabilistic” utility functions, consider it as Bayesian uncertainty: “I have degree of belief P that X should happen, and degree of belief 1-P that something else should happen.”
One of the deep philosophical differences is that both Fristonian neurosci and Tenenbaumian cocosci assume that stochasticity is “real enough for government work”, and so there’s no point in specifying “utility functions” over “states” of the world in which all variables are clamped to fully determined values. After all, you yourself as a physically implemented agent have to generate waste heat, so there’s inevitably going to be some stochasticity (call it uncertainty that you’re mathematically required to have) about whatever physical heat bath you dumped your own waste heat into.
(That was supposed to be a reference to Eliezer’s writing on minds doing thermodynamic work (which free-energy minds absolutely do!), not a poop joke.)
Actually, here’s a much simpler, more intuitive way to think about probabilistically specified goals.
Visualize a probability distribution as a heat map of the possibility space. Specifying a probabilistic goal then just says, “Here’s where I want the heat to concentrate”, and submitting it to active inference just uses the available inferential machinery to actually squeeze the heat into that exact concentration as best you can.
When our heat-map takes the form of “heat” over dynamical trajectories, possible “timelines” of something that can move, “squeezing the heat into your desired concentration” means exactly “squeezing the future towards desired regions”. All you’re changing is how you specify desired regions: from giving them an “absolute” value (that can actually undergo any linear transformation and be isomorphic) to giving them a purely “relative” value (relative to disjoint events in your sample space).
This is fine, because after all, it’s not like you could really have an “infinite” desire for something finite-sized in the first place. If you choose to think of utilities in terms of money, the “goal probabilities” are just the relative prices you’re willing to pay for a certain outcome: you start with odds, the number of apples you’ll trade for an orange, and convert from odds to probabilities to get your numbers. It’s just using “barter” among disjoint random events instead of “currency”.
I’m confused so I’ll comment a dumb question hoping my cognitive algorithms are sufficiently similar to other LW:ers, such that they’ll be thinking but not writing this question.
“If I value apples at 3 units and oranges at 1 unit, I don’t want at 75%/25% split. I only want apples, because they’re better! (I have no diminishing returns.)”
Where does this reasoning go wrong?
>”If I value apples at 3 units and oranges at 1 unit, I don’t want at 75%/25% split. I only want apples, because they’re better! (I have no diminishing returns.)”
I think what I’d have to ask here is: if you only want apples, why are you spending your money on oranges? If you will not actually pay me 1 unit for an orange, why do you claim you value oranges at 1 unit?
Another construal: you value oranges at 1 orange per 1 unit because if I offer you a lottery over those and let you set the odds yourself, you will choose to set them to 50⁄50. You’re indifferent to which one you receive, so you value them equally. We do the same trick with apples and find you value them at 3 units per 1 apple.
I now offer you a lottery between receiving 3 apples and 1 orange, and I’ll let you pay 3 units to tilt the odds by one expected apple. Since the starting point was 1.5 expected apples and 0.5 expected oranges, and you insist you want only 3 expected apples and 0 expected oranges, I believe I can make you end up paying more than 3 units per apple now, despite our having established that as your “price”.
The lesson is, I think, don’t offer to pay finite amounts of money for outcomes you want literally zero of, as someone may in fact try to take you up on it.
The problem with the typeface on LW comments is that I, l and 1 look really damn similar.
That was much more informative than most of the papers. Did you learn this by parsing the papers or from another better source?
Honestly, I’ve just had to go back and forth banging my head on Friston’s free-energy papers, non-Friston free-energy papers, and the ordinary variational inference literature—for the past two years, prior to which I spent three years banging my head on the Josh Tenenbaum-y computational cog-sci literature and got used to seeing probabilistic models of cognition.
I’m now really fucking glad to be in a PhD program where I can actually use that knowledge.
Oh, and btw, everyone at MIRI was exactly as confused as Scott is when I presented a bunch of free-energy stuff to them last March.
Sorry for the bold, sometimes our editor does weird things with copy-paste and bolds everything you pasted. Working on a fix for that, but it’s an external library and that’s always a bit harder than fixing our code.
Re: the “when friends and colleagues first come across this conclusion...” quote:
A world where everybody’s true desire is to rest in bed as much as possible, but where they grudgingly take the actions needed to stay alive and maintain homeostasis, seems both very imaginable, and also very different from what we observe.
Agreed. ‘Rest in bed as much as possible but grudgingly take the actions needed to stay alive’ sounds a lot like depression, but there exist non-depressed people who need explaining.
I wonder if the conversion from mathematics to language is causing problems somewhere. The prose description you are working with is ‘take actions that minimize prediction error’ but the actual model is ‘take actions that minimize a complicated construct called free energy’. Sitting in a dark room certainly works for the former but I don’t know how to calculate it for the latter.
In the paper I linked, the free energy minimizing trolleycar does not sit in the valley and do nothing to minimize prediction error. It moves to keep itself on the dynamic escape trajectory that it was trained with and so predicts itself achieving. So if we understood why that happens we might unravel the confusion.
>I wonder if the conversion from mathematics to language is causing problems somewhere. The prose description you are working with is ‘take actions that minimize prediction error’ but the actual model is ‘take actions that minimize a complicated construct called free energy’. Sitting in a dark room certainly works for the former but I don’t know how to calculate it for the latter.
There’s absolutely trouble here. “Minimizing surprise” always means, to Friston, minimizing sensory surprise under a generative model: −logp(s|m) . The problem is that, of course, in the course of constructing this, you had to marginalize out all the interesting variables that make up your generative model, so you’re really looking at −log∫r∫Ψp(s,r,Ψ|m) or something similar.
Mistaking “surprise” in this context for the actual self-information of the empirical distribution of sense-data −logp(s) makes the whole thing fall apart.
>In the paper I linked, the free energy minimizing trolleycar does not sit in the valley and do nothing to minimize prediction error. It moves to keep itself on the dynamic escape trajectory that it was trained with and so predicts itself achieving. So if we understood why that happens we might unravel the confusion.
If you look closely, Friston’s downright cheating in that paper. First he “immerses” his car in its “statistical bath” that teaches it where to go, with only perceptual inference allowed. Then he turns off perceptual updating, leaving only action as a means of resolving free-energy, and points out that thusly, the car tries to climb the mountain as active inference proceeds.
It would be interesting if anyone knows of historical examples where someone had a key insight, but nonetheless fulfilled your “emperor has no clothes” criteria.