Here’s what it would typically look like in a control theory problem.
There’s a long term utility uL which is a function of the final state x(T), and a short term utility uS which is a function of time t, the state x(t) at time t, and the action a(t) at time t. (Often the problem is formulated with a discount rate γ , but in this case we’re allowing time-dependent short-term utility, so we can just absorb the discount rate into uS). The objective is then to maximize
uL(x(T))+∑tuS(t,x(t),a(t))
In that case, the value function V(x,t) is a max over trajectories starting at (x,t):
V(x,t)=maxtrajectoryuL(x(T))+∑τ≥tuS(τ,x(τ),a(τ))
The key thing to notice is that we can solve that equation for uS(t,x(t),a(t)):
So given an arbitrary value function V, we can find a short-term utility function uS which produces that value function by using that equation to compute uS starting from the last timestep and working backwards.
Thus the claim from the post: for any value function, there exists a short-term utility function which induces that value function.
What if we restrict to only consider long-term utility, i.e. set uS=0? Well, then the value function is no longer so arbitrary. That’s the case considered in the post, where we have constraints which the value function must satisfy regardless of uL.
I had to mull over it for five days, hunt down some background materials to fill in context, write follow up questions to a few friends (reviewing responses over phone while commuting), and then slowly chew through the math on pencil and paper when I could get spare time… but yes I understand now!
Here’s what it would typically look like in a control theory problem.
There’s a long term utility uL which is a function of the final state x(T), and a short term utility uS which is a function of time t, the state x(t) at time t, and the action a(t) at time t. (Often the problem is formulated with a discount rate γ , but in this case we’re allowing time-dependent short-term utility, so we can just absorb the discount rate into uS). The objective is then to maximize
uL(x(T))+∑tuS(t,x(t),a(t))
In that case, the value function V(x,t) is a max over trajectories starting at (x,t):
V(x,t)=maxtrajectoryuL(x(T))+∑τ≥tuS(τ,x(τ),a(τ))
The key thing to notice is that we can solve that equation for uS(t,x(t),a(t)):
uS(t,x(t),a(t))=maxtrajectory(uL(x(T))+∑τ>tuS(τ,x(τ),a(τ)))−V(x,t)
So given an arbitrary value function V, we can find a short-term utility function uS which produces that value function by using that equation to compute uS starting from the last timestep and working backwards.
Thus the claim from the post: for any value function, there exists a short-term utility function which induces that value function.
What if we restrict to only consider long-term utility, i.e. set uS=0? Well, then the value function is no longer so arbitrary. That’s the case considered in the post, where we have constraints which the value function must satisfy regardless of uL.
Did that clarify?
I had to mull over it for five days, hunt down some background materials to fill in context, write follow up questions to a few friends (reviewing responses over phone while commuting), and then slowly chew through the math on pencil and paper when I could get spare time… but yes I understand now!