The various papers don’t all even implement the same model—the free energy principle seems to be more a design principle than a specific model.`
Bingo. Friston trained as a physicist, and he wants the free-energy principle to be more like a physical law than a computer program. You can write basically any computer program that implements or supports variational inference, throw in some action states as variational parameters, and you’ve “implemented” the free-energy principle _in some way_.
Overall, the Principle is more of a domain-specific language than a single unified model, more like “supervised learning” than like “this 6-layer convnet I trained for neural style transfer.”
Are there priors that cannot be represented as utility functions, or vice versa?
No. They’re isomorphic, via the Complete Class Theorem. Any utility/cost function that grows sub-super-exponentially (ie: for which Pascal’s Mugging doesn’t happen) can be expressed as a distribution, and used in the free-energy principle. You can get the intuition by thinking, “This goal specifies how often I want to see outcome X (P), versus its disjoint cousins Y and Z that I want to see such-or-so often (1-P).”
What explore/exploit tradeoffs do free-energy models lead to, or can they encode any given tradeoff?
The is actually one of the Very Good things about free-energy models: since free-energy is “Energy—Entropy”, or “Exploit + Explore”, cast in the same units (bits/nats from info theory), it theorizes a principled, prescriptive way to make the tradeoff, once you’ve specified how concentrated the probability mass is under the goals in the support set (and thus the multiplicative inverse of the exploit term’s global optimum).
We ought to be able to use this to test the Principle empirically, I think.
No. They’re isomorphic, via the Complete Class Theorem. Any utility/cost function that grows sub-super-exponentially (ie: for which Pascal’s Mugging doesn’t happen) can be expressed as a distribution, and used in the free-energy principle. You can get the intuition by thinking, “This goal specifies how often I want to see outcome X (P), versus its disjoint cousins Y and Z that I want to see such-or-so often (1-P).”
Can you please link me to more on this? I was under the impression that pascal’s mugging happens for any utility function that grows at least as fast as the probabilities shrink, and the probabilities shrink exponentially for normal probability functions. (For example: In the toy model of the St. Petersburg problem, the utility function grows exactly as fast as the probability function shrinks, resulting in infinite expected utility for playing the game.)
Also: As I understand them, utility functions aren’t of the form “I want to see X P often and Y 1-P often.” They are more like “X has utility 200, Y has utility 150, Z has utility 24...” Maybe the form you are talking about is a special case of the form I am talking about, but I don’t yet see how it could be the other way around. As I’m thinking of them, utility functions aren’t about what you see at all. They are just about the world. The point is, I’m confused by your explanation & would love to read more about this.
Can you please link me to more on this? I was under the impression that pascal’s mugging happens for any utility function that grows at least as fast as the probabilities shrink, and the probabilities shrink exponentially for normal probability functions. (For example: In the toy model of the St. Petersburg problem, the utility function grows exactly as fast as the probability function shrinks, resulting in infinite expected utility for playing the game.)
The Complete Class Theorem says that bounded cost/utility functions are isomorphic to posterior probabilities optimizing their expected values. In that sense, it’s almost a trivial result.
In practice, this just means that we can exchange the two whenever we please: we can take a probability and get an entropy to minimize, or we can take a bounded utility/cost function and bung it through a Boltzmann Distribution.
Also: As I understand them, utility functions aren’t of the form “I want to see X P often and Y 1-P often.” They are more like “X has utility 200, Y has utility 150, Z has utility 24...” Maybe the form you are talking about is a special case of the form I am talking about, but I don’t yet see how it could be the other way around. As I’m thinking of them, utility functions aren’t about what you see at all. They are just about the world. The point is, I’m confused by your explanation & would love to read more about this.
I was speaking loosely, so “I want to see X” can be taken as, “I want X to happen”. The details remain an open research problem of how the brain (or probabilistic AI) can or should cash out, “X happens” into “here are all the things I expect to observe when X happens, and I use them to gather evidence for whether X has happened, and to control whether X happens and how often”.
For a metaphor of why you’d have “probabilistic” utility functions, consider it as Bayesian uncertainty: “I have degree of belief P that X should happen, and degree of belief 1-P that something else should happen.”
One of the deep philosophical differences is that both Fristonian neurosci and Tenenbaumian cocosci assume that stochasticity is “real enough for government work”, and so there’s no point in specifying “utility functions” over “states” of the world in which all variables are clamped to fully determined values. After all, you yourself as a physically implemented agent have to generate waste heat, so there’s inevitably going to be some stochasticity (call it uncertainty that you’re mathematically required to have) about whatever physical heat bath you dumped your own waste heat into.
(That was supposed to be a reference to Eliezer’s writing on minds doing thermodynamic work (which free-energy minds absolutely do!), not a poop joke.)
Actually, here’s a much simpler, more intuitive way to think about probabilistically specified goals.
Visualize a probability distribution as a heat map of the possibility space. Specifying a probabilistic goal then just says, “Here’s where I want the heat to concentrate”, and submitting it to active inference just uses the available inferential machinery to actually squeeze the heat into that exact concentration as best you can.
When our heat-map takes the form of “heat” over dynamical trajectories, possible “timelines” of something that can move, “squeezing the heat into your desired concentration” means exactly “squeezing the future towards desired regions”. All you’re changing is how you specify desired regions: from giving them an “absolute” value (that can actually undergo any linear transformation and be isomorphic) to giving them a purely “relative” value (relative to disjoint events in your sample space).
This is fine, because after all, it’s not like you could really have an “infinite” desire for something finite-sized in the first place. If you choose to think of utilities in terms of money, the “goal probabilities” are just the relative prices you’re willing to pay for a certain outcome: you start with odds, the number of apples you’ll trade for an orange, and convert from odds to probabilities to get your numbers. It’s just using “barter” among disjoint random events instead of “currency”.
I’m confused so I’ll comment a dumb question hoping my cognitive algorithms are sufficiently similar to other LW:ers, such that they’ll be thinking but not writing this question.
“If I value apples at 3 units and oranges at 1 unit, I don’t want at 75%/25% split. I only want apples, because they’re better! (I have no diminishing returns.)”
>”If I value apples at 3 units and oranges at 1 unit, I don’t want at 75%/25% split. I only want apples, because they’re better! (I have no diminishing returns.)”
I think what I’d have to ask here is: if you only want apples, why are you spending your money on oranges? If you will not actually pay me 1 unit for an orange, why do you claim you value oranges at 1 unit?
Another construal: you value oranges at 1 orange per 1 unit because if I offer you a lottery over those and let you set the odds yourself, you will choose to set them to 50⁄50. You’re indifferent to which one you receive, so you value them equally. We do the same trick with apples and find you value them at 3 units per 1 apple.
I now offer you a lottery between receiving 3 apples and 1 orange, and I’ll let you pay 3 units to tilt the odds by one expected apple. Since the starting point was 1.5 expected apples and 0.5 expected oranges, and you insist you want only 3 expected apples and 0 expected oranges, I believe I can make you end up paying more than 3 units per apple now, despite our having established that as your “price”.
The lesson is, I think, don’t offer to pay finite amounts of money for outcomes you want literally zero of, as someone may in fact try to take you up on it.
Honestly, I’ve just had to go back and forth banging my head on Friston’s free-energy papers, non-Fristonfree-energy papers, and the ordinary variational inference literature—for the past two years, prior to which I spent three years banging my head on the Josh Tenenbaum-y computational cog-sci literature and got used to seeing probabilistic models of cognition.
I’m now really fucking glad to be in a PhD program where I can actually use that knowledge.
Oh, and btw, everyone at MIRI was exactly as confused as Scott is when I presented a bunch of free-energy stuff to them last March.
Sorry for the bold, sometimes our editor does weird things with copy-paste and bolds everything you pasted. Working on a fix for that, but it’s an external library and that’s always a bit harder than fixing our code.
Bingo. Friston trained as a physicist, and he wants the free-energy principle to be more like a physical law than a computer program. You can write basically any computer program that implements or supports variational inference, throw in some action states as variational parameters, and you’ve “implemented” the free-energy principle _in some way_.
Overall, the Principle is more of a domain-specific language than a single unified model, more like “supervised learning” than like “this 6-layer convnet I trained for neural style transfer.”
No. They’re isomorphic, via the Complete Class Theorem. Any utility/cost function that grows sub-super-exponentially (ie: for which Pascal’s Mugging doesn’t happen) can be expressed as a distribution, and used in the free-energy principle. You can get the intuition by thinking, “This goal specifies how often I want to see outcome X (P), versus its disjoint cousins Y and Z that I want to see such-or-so often (1-P).”
The is actually one of the Very Good things about free-energy models: since free-energy is “Energy—Entropy”, or “Exploit + Explore”, cast in the same units (bits/nats from info theory), it theorizes a principled, prescriptive way to make the tradeoff, once you’ve specified how concentrated the probability mass is under the goals in the support set (and thus the multiplicative inverse of the exploit term’s global optimum).
We ought to be able to use this to test the Principle empirically, I think.
(EDIT: Dear God, why was everything bold!?)
Can you please link me to more on this? I was under the impression that pascal’s mugging happens for any utility function that grows at least as fast as the probabilities shrink, and the probabilities shrink exponentially for normal probability functions. (For example: In the toy model of the St. Petersburg problem, the utility function grows exactly as fast as the probability function shrinks, resulting in infinite expected utility for playing the game.)
Also: As I understand them, utility functions aren’t of the form “I want to see X P often and Y 1-P often.” They are more like “X has utility 200, Y has utility 150, Z has utility 24...” Maybe the form you are talking about is a special case of the form I am talking about, but I don’t yet see how it could be the other way around. As I’m thinking of them, utility functions aren’t about what you see at all. They are just about the world. The point is, I’m confused by your explanation & would love to read more about this.
The Complete Class Theorem says that bounded cost/utility functions are isomorphic to posterior probabilities optimizing their expected values. In that sense, it’s almost a trivial result.
In practice, this just means that we can exchange the two whenever we please: we can take a probability and get an entropy to minimize, or we can take a bounded utility/cost function and bung it through a Boltzmann Distribution.
I was speaking loosely, so “I want to see X” can be taken as, “I want X to happen”. The details remain an open research problem of how the brain (or probabilistic AI) can or should cash out, “X happens” into “here are all the things I expect to observe when X happens, and I use them to gather evidence for whether X has happened, and to control whether X happens and how often”.
For a metaphor of why you’d have “probabilistic” utility functions, consider it as Bayesian uncertainty: “I have degree of belief P that X should happen, and degree of belief 1-P that something else should happen.”
One of the deep philosophical differences is that both Fristonian neurosci and Tenenbaumian cocosci assume that stochasticity is “real enough for government work”, and so there’s no point in specifying “utility functions” over “states” of the world in which all variables are clamped to fully determined values. After all, you yourself as a physically implemented agent have to generate waste heat, so there’s inevitably going to be some stochasticity (call it uncertainty that you’re mathematically required to have) about whatever physical heat bath you dumped your own waste heat into.
(That was supposed to be a reference to Eliezer’s writing on minds doing thermodynamic work (which free-energy minds absolutely do!), not a poop joke.)
Actually, here’s a much simpler, more intuitive way to think about probabilistically specified goals.
Visualize a probability distribution as a heat map of the possibility space. Specifying a probabilistic goal then just says, “Here’s where I want the heat to concentrate”, and submitting it to active inference just uses the available inferential machinery to actually squeeze the heat into that exact concentration as best you can.
When our heat-map takes the form of “heat” over dynamical trajectories, possible “timelines” of something that can move, “squeezing the heat into your desired concentration” means exactly “squeezing the future towards desired regions”. All you’re changing is how you specify desired regions: from giving them an “absolute” value (that can actually undergo any linear transformation and be isomorphic) to giving them a purely “relative” value (relative to disjoint events in your sample space).
This is fine, because after all, it’s not like you could really have an “infinite” desire for something finite-sized in the first place. If you choose to think of utilities in terms of money, the “goal probabilities” are just the relative prices you’re willing to pay for a certain outcome: you start with odds, the number of apples you’ll trade for an orange, and convert from odds to probabilities to get your numbers. It’s just using “barter” among disjoint random events instead of “currency”.
I’m confused so I’ll comment a dumb question hoping my cognitive algorithms are sufficiently similar to other LW:ers, such that they’ll be thinking but not writing this question.
“If I value apples at 3 units and oranges at 1 unit, I don’t want at 75%/25% split. I only want apples, because they’re better! (I have no diminishing returns.)”
Where does this reasoning go wrong?
>”If I value apples at 3 units and oranges at 1 unit, I don’t want at 75%/25% split. I only want apples, because they’re better! (I have no diminishing returns.)”
I think what I’d have to ask here is: if you only want apples, why are you spending your money on oranges? If you will not actually pay me 1 unit for an orange, why do you claim you value oranges at 1 unit?
Another construal: you value oranges at 1 orange per 1 unit because if I offer you a lottery over those and let you set the odds yourself, you will choose to set them to 50⁄50. You’re indifferent to which one you receive, so you value them equally. We do the same trick with apples and find you value them at 3 units per 1 apple.
I now offer you a lottery between receiving 3 apples and 1 orange, and I’ll let you pay 3 units to tilt the odds by one expected apple. Since the starting point was 1.5 expected apples and 0.5 expected oranges, and you insist you want only 3 expected apples and 0 expected oranges, I believe I can make you end up paying more than 3 units per apple now, despite our having established that as your “price”.
The lesson is, I think, don’t offer to pay finite amounts of money for outcomes you want literally zero of, as someone may in fact try to take you up on it.
The problem with the typeface on LW comments is that I, l and 1 look really damn similar.
That was much more informative than most of the papers. Did you learn this by parsing the papers or from another better source?
Honestly, I’ve just had to go back and forth banging my head on Friston’s free-energy papers, non-Friston free-energy papers, and the ordinary variational inference literature—for the past two years, prior to which I spent three years banging my head on the Josh Tenenbaum-y computational cog-sci literature and got used to seeing probabilistic models of cognition.
I’m now really fucking glad to be in a PhD program where I can actually use that knowledge.
Oh, and btw, everyone at MIRI was exactly as confused as Scott is when I presented a bunch of free-energy stuff to them last March.
Sorry for the bold, sometimes our editor does weird things with copy-paste and bolds everything you pasted. Working on a fix for that, but it’s an external library and that’s always a bit harder than fixing our code.