I don’t think I even understand the most basic point about how a probability distribution equals a utility function. What’s the probability distribution equal to “maximize paperclips”? Is it “state of the world with lots of paperclips − 100%, state of the world with no paperclips, 0%”? How do you assign probability to states of the world with 5, 10, or 200 paperclips?
I know nothing about this discussion, but this one is easy:
The utility function U(w) corresponds to the distribution P(w)∝exp(U(w)).
(i.e. P(w)=exp(U(w))/Z, where Z is a meaningless number we choose to make the total probability add up to 1.)
Without math: every time you add one paperclip to a possible world, you make it 10% more likely. On this perspective, there is a difference between kind of wanting paperclips and really wanting paperclips—if you really want paperclips, adding one paperclip to the world makes it twice as likely. This determines how you trade off paperclips vs. other kinds of surprise.
Maximizing expected log probability under this distribution is exactly the same as maximizing the expectation of U.
You can combine the exp(U) term with other facts you know about the world w, by multiplying them (and then adjusting the normalization constant Z appropriately).
A very similar formulation is often used in inverse reinforcement learning (MaxEnt IRL).
Another part of the picture that isn’t complicated is that the exact same algorithms can be used for probabilistic inference (finding good explanations for the data) and planning (finding a plan that achieves some goal). In fact this connection is useful and people in AI sometimes exploit it. It’s a bit deeper than it sounds but not that deep. See planning as inference, which Eli mentions above. It seems worth understanding this simple idea before trying to understand some extremely confusing pile of ideas.
Another important distinction: there are two different algorithms one might describe as “minimizing prediction error:”
I think the more natural one is algorithm A: you adjust your beliefs to minimize prediction error (after translating your preferences into “optimistic beliefs”). Then you act according to your beliefs about how you will act. This is equivalent to independently forming beliefs and then acting to get what you want, it’s just an implementation detail.
There is a much more complicated family of algorithms, call them algorithm B, where you actually plan in order to change the observations you’ll make in the future, with the goal of minimizing prediction error. This is the version that would cause you to e.g. go read a textbook, or lock yourself in a dark room. This version is algorithmically way more complicated to implement, even though it maybe sounds simpler. It also has all kinds of weird implications and it’s not easy to see how to turn it into something that isn’t obviously wrong.
Regardless of which view you prefer, it seems important to recognize the difference between the two. In particular, evidence for the us using algorithm A shouldn’t be interpreted as evidence that we use algorithm B.
It sounds like Friston intends algorithm B. This version is pretty different from anything that researchers in AI use, and I’m pretty skeptical (based on observations of humans and the surface implausibility of the story rather than any knowledge about the area).
Paul, this is very helpful! Finally I understand what this “active inference” stuff is about. I wonder whether there were any significant theoretical results about these methods since Rawlik et al 2012?
The utility function U(w) corresponds to the distribution P(w)∝exp(U(w)).
Not so fast.
Keep in mind that the utility function is defined up to an arbitrary positive affine transformation, while the softmax distribution is invariant only up to shifts: P(w)∝exp(βU(w)) will be different distribution depending on the inverse temperature β (the higher, the more peaked the distribution will be on the mode), while in von Neumann–Morgenstern theory of utility, U(w) and ^U≡βU(w) represent the same preferences for any positive β .
Maximizing expected log probability under this distribution is exactly the same as maximizing the expectation of U.
It’s not exactly the same.
Let’s assume that there are two possible world states: 0 and 1, and two available actions: action A puts the world in state 0 with 99% probability (QA(0)=0.99) while action B puts the world in state 0 with 50% probability (QB(0)=0.5).
Let U(0)=10−3,U(1)=0
Under expected utility maximizaiton, action A is clearly optimal.
Now define P(w)∝exp(U(w))
The expected log-probability (the negative cross-entropy) −H(P,QA) is ≈−2.31 nats, while −H(P,QB) is −0.69 , hence action B is optimal.
You do get the action A as optimal if you reverse the distributions in the negative cross-entropies (−H(QA,P) and −H(QB,P)), but this does not correspond to how inference is normally done.
To get behavior you need preferences + temperature, that’s what I meant by saying there was a difference between wanting X a little and wanting X a lot.
I agree that the formulation I gave benefits actions that generate a lot of entropy. Really you want to consider causal entropy of your actions. I think that means P(τ)∝exp(E(U(τ))) for each sequence of actions τ I agree that’s less elegant.
Scott writes on tumblr:
I know nothing about this discussion, but this one is easy:
The utility function U(w) corresponds to the distribution P(w)∝exp(U(w)).
(i.e. P(w)=exp(U(w))/Z, where Z is a meaningless number we choose to make the total probability add up to 1.)
Without math: every time you add one paperclip to a possible world, you make it 10% more likely. On this perspective, there is a difference between kind of wanting paperclips and really wanting paperclips—if you really want paperclips, adding one paperclip to the world makes it twice as likely. This determines how you trade off paperclips vs. other kinds of surprise.
Maximizing expected log probability under this distribution is exactly the same as maximizing the expectation of U.
You can combine the exp(U) term with other facts you know about the world w, by multiplying them (and then adjusting the normalization constant Z appropriately).
A very similar formulation is often used in inverse reinforcement learning (MaxEnt IRL).
Another part of the picture that isn’t complicated is that the exact same algorithms can be used for probabilistic inference (finding good explanations for the data) and planning (finding a plan that achieves some goal). In fact this connection is useful and people in AI sometimes exploit it. It’s a bit deeper than it sounds but not that deep. See planning as inference, which Eli mentions above. It seems worth understanding this simple idea before trying to understand some extremely confusing pile of ideas.
Another important distinction: there are two different algorithms one might describe as “minimizing prediction error:”
I think the more natural one is algorithm A: you adjust your beliefs to minimize prediction error (after translating your preferences into “optimistic beliefs”). Then you act according to your beliefs about how you will act. This is equivalent to independently forming beliefs and then acting to get what you want, it’s just an implementation detail.
There is a much more complicated family of algorithms, call them algorithm B, where you actually plan in order to change the observations you’ll make in the future, with the goal of minimizing prediction error. This is the version that would cause you to e.g. go read a textbook, or lock yourself in a dark room. This version is algorithmically way more complicated to implement, even though it maybe sounds simpler. It also has all kinds of weird implications and it’s not easy to see how to turn it into something that isn’t obviously wrong.
Regardless of which view you prefer, it seems important to recognize the difference between the two. In particular, evidence for the us using algorithm A shouldn’t be interpreted as evidence that we use algorithm B.
It sounds like Friston intends algorithm B. This version is pretty different from anything that researchers in AI use, and I’m pretty skeptical (based on observations of humans and the surface implausibility of the story rather than any knowledge about the area).
Paul, this is very helpful! Finally I understand what this “active inference” stuff is about. I wonder whether there were any significant theoretical results about these methods since Rawlik et al 2012?
Oh hey, so that’s the original KL control paper. Saved!
Not so fast.
Keep in mind that the utility function is defined up to an arbitrary positive affine transformation, while the softmax distribution is invariant only up to shifts: P(w)∝exp(βU(w)) will be different distribution depending on the inverse temperature β (the higher, the more peaked the distribution will be on the mode), while in von Neumann–Morgenstern theory of utility, U(w) and ^U≡βU(w) represent the same preferences for any positive β .
It’s not exactly the same.
Let’s assume that there are two possible world states: 0 and 1, and two available actions: action A puts the world in state 0 with 99% probability (QA(0)=0.99) while action B puts the world in state 0 with 50% probability (QB(0)=0.5).
Let U(0)=10−3,U(1)=0
Under expected utility maximizaiton, action A is clearly optimal.
Now define P(w)∝exp(U(w))
The expected log-probability (the negative cross-entropy) −H(P,QA) is ≈−2.31 nats, while −H(P,QB) is −0.69 , hence action B is optimal.
You do get the action A as optimal if you reverse the distributions in the negative cross-entropies (−H(QA,P) and −H(QB,P)), but this does not correspond to how inference is normally done.
To get behavior you need preferences + temperature, that’s what I meant by saying there was a difference between wanting X a little and wanting X a lot.
I agree that the formulation I gave benefits actions that generate a lot of entropy. Really you want to consider causal entropy of your actions. I think that means P(τ)∝exp(E(U(τ))) for each sequence of actions τ I agree that’s less elegant.