The utility function U(w) corresponds to the distribution P(w)∝exp(U(w)).
Not so fast.
Keep in mind that the utility function is defined up to an arbitrary positive affine transformation, while the softmax distribution is invariant only up to shifts: P(w)∝exp(βU(w)) will be different distribution depending on the inverse temperature β (the higher, the more peaked the distribution will be on the mode), while in von Neumann–Morgenstern theory of utility, U(w) and ^U≡βU(w) represent the same preferences for any positive β .
Maximizing expected log probability under this distribution is exactly the same as maximizing the expectation of U.
It’s not exactly the same.
Let’s assume that there are two possible world states: 0 and 1, and two available actions: action A puts the world in state 0 with 99% probability (QA(0)=0.99) while action B puts the world in state 0 with 50% probability (QB(0)=0.5).
Let U(0)=10−3,U(1)=0
Under expected utility maximizaiton, action A is clearly optimal.
Now define P(w)∝exp(U(w))
The expected log-probability (the negative cross-entropy) −H(P,QA) is ≈−2.31 nats, while −H(P,QB) is −0.69 , hence action B is optimal.
You do get the action A as optimal if you reverse the distributions in the negative cross-entropies (−H(QA,P) and −H(QB,P)), but this does not correspond to how inference is normally done.
To get behavior you need preferences + temperature, that’s what I meant by saying there was a difference between wanting X a little and wanting X a lot.
I agree that the formulation I gave benefits actions that generate a lot of entropy. Really you want to consider causal entropy of your actions. I think that means P(τ)∝exp(E(U(τ))) for each sequence of actions τ I agree that’s less elegant.
Not so fast.
Keep in mind that the utility function is defined up to an arbitrary positive affine transformation, while the softmax distribution is invariant only up to shifts: P(w)∝exp(βU(w)) will be different distribution depending on the inverse temperature β (the higher, the more peaked the distribution will be on the mode), while in von Neumann–Morgenstern theory of utility, U(w) and ^U≡βU(w) represent the same preferences for any positive β .
It’s not exactly the same.
Let’s assume that there are two possible world states: 0 and 1, and two available actions: action A puts the world in state 0 with 99% probability (QA(0)=0.99) while action B puts the world in state 0 with 50% probability (QB(0)=0.5).
Let U(0)=10−3,U(1)=0
Under expected utility maximizaiton, action A is clearly optimal.
Now define P(w)∝exp(U(w))
The expected log-probability (the negative cross-entropy) −H(P,QA) is ≈−2.31 nats, while −H(P,QB) is −0.69 , hence action B is optimal.
You do get the action A as optimal if you reverse the distributions in the negative cross-entropies (−H(QA,P) and −H(QB,P)), but this does not correspond to how inference is normally done.
To get behavior you need preferences + temperature, that’s what I meant by saying there was a difference between wanting X a little and wanting X a lot.
I agree that the formulation I gave benefits actions that generate a lot of entropy. Really you want to consider causal entropy of your actions. I think that means P(τ)∝exp(E(U(τ))) for each sequence of actions τ I agree that’s less elegant.