Here’s what it would typically look like in a control theory problem.
There’s a long term utility uL which is a function of the final state x(T), and a short term utility uS which is a function of time t, the state x(t) at time t, and the action a(t) at time t. (Often the problem is formulated with a discount rate γ , but in this case we’re allowing time-dependent short-term utility, so we can just absorb the discount rate into uS). The objective is then to maximize
uL(x(T))+∑tuS(t,x(t),a(t))
In that case, the value function V(x,t) is a max over trajectories starting at (x,t):
V(x,t)=maxtrajectoryuL(x(T))+∑τ≥tuS(τ,x(τ),a(τ))
The key thing to notice is that we can solve that equation for uS(t,x(t),a(t)):
So given an arbitrary value function V, we can find a short-term utility function uS which produces that value function by using that equation to compute uS starting from the last timestep and working backwards.
Thus the claim from the post: for any value function, there exists a short-term utility function which induces that value function.
What if we restrict to only consider long-term utility, i.e. set uS=0? Well, then the value function is no longer so arbitrary. That’s the case considered in the post, where we have constraints which the value function must satisfy regardless of uL.
“Reward function” is a much more general term, which IMO has been overused to the point where it arguably doesn’t even have a clear meaning. “Utility function” is less general: it always connotes an optimization objective, something which is being optimized for directly. And that basically matches the usage here.
I had to mull over it for five days, hunt down some background materials to fill in context, write follow up questions to a few friends (reviewing responses over phone while commuting), and then slowly chew through the math on pencil and paper when I could get spare time… but yes I understand now!
I don’t understand this part:
”any value function can be maximized by some utility function over short-term outcomes.”
what is the difference between far in the future and near in the future?
Here’s what it would typically look like in a control theory problem.
There’s a long term utility uL which is a function of the final state x(T), and a short term utility uS which is a function of time t, the state x(t) at time t, and the action a(t) at time t. (Often the problem is formulated with a discount rate γ , but in this case we’re allowing time-dependent short-term utility, so we can just absorb the discount rate into uS). The objective is then to maximize
uL(x(T))+∑tuS(t,x(t),a(t))
In that case, the value function V(x,t) is a max over trajectories starting at (x,t):
V(x,t)=maxtrajectoryuL(x(T))+∑τ≥tuS(τ,x(τ),a(τ))
The key thing to notice is that we can solve that equation for uS(t,x(t),a(t)):
uS(t,x(t),a(t))=maxtrajectory(uL(x(T))+∑τ>tuS(τ,x(τ),a(τ)))−V(x,t)
So given an arbitrary value function V, we can find a short-term utility function uS which produces that value function by using that equation to compute uS starting from the last timestep and working backwards.
Thus the claim from the post: for any value function, there exists a short-term utility function which induces that value function.
What if we restrict to only consider long-term utility, i.e. set uS=0? Well, then the value function is no longer so arbitrary. That’s the case considered in the post, where we have constraints which the value function must satisfy regardless of uL.
Did that clarify?
what’s wrong with calling the “short-term utility function” a “reward function”?
“Reward function” is a much more general term, which IMO has been overused to the point where it arguably doesn’t even have a clear meaning. “Utility function” is less general: it always connotes an optimization objective, something which is being optimized for directly. And that basically matches the usage here.
I had to mull over it for five days, hunt down some background materials to fill in context, write follow up questions to a few friends (reviewing responses over phone while commuting), and then slowly chew through the math on pencil and paper when I could get spare time… but yes I understand now!