johnswentworth comments on Coherence of Caches and Agents

johnswentworth 2 Apr 2024 18:46 UTC
7 points
0
Here’s what it would typically look like in a control theory problem.
There’s a long term utility $u^{L}$ which is a function of the final state $x (T)$ , and a short term utility $u^{S}$ which is a function of time $t$ , the state $x (t)$ at time $t$ , and the action $a (t)$ at time $t$ . (Often the problem is formulated with a discount rate $γ$ , but in this case we’re allowing time-dependent short-term utility, so we can just absorb the discount rate into $u^{S}$ ). The objective is then to maximize
$u^{L} (x (T)) + \sum_{t} u^{S} (t, x (t), a (t))$
In that case, the value function $V (x, t)$ is a max over trajectories starting at $(x, t)$ :
$V (x, t) = {max}_{trajectory} u^{L} (x (T)) + \sum_{τ \geq t} u^{S} (τ, x (τ), a (τ))$
The key thing to notice is that we can solve that equation for $u^{S} (t, x (t), a (t))$ :
$u^{S} (t, x (t), a (t)) = {max}_{trajectory} (u^{L} (x (T)) + \sum_{τ > t} u^{S} (τ, x (τ), a (τ))) - V (x, t)$
So given an arbitrary value function $V$ , we can find a short-term utility function $u^{S}$ which produces that value function by using that equation to compute $u^{S}$ starting from the last timestep and working backwards.
Thus the claim from the post: for any value function, there exists a short-term utility function which induces that value function.
What if we restrict to only consider long-term utility, i.e. set $u^{S} = 0$ ? Well, then the value function is no longer so arbitrary. That’s the case considered in the post, where we have constraints which the value function must satisfy regardless of $u^{L}$ .
Did that clarify?
- martinkunev 23 Aug 2024 23:50 UTC
  3 points
  0
  Parent
  what’s wrong with calling the “short-term utility function” a “reward function”?
  - johnswentworth 25 Aug 2024 16:34 UTC
    5 points
    0
    Parent
    “Reward function” is a much more general term, which IMO has been overused to the point where it arguably doesn’t even have a clear meaning. “Utility function” is less general: it always connotes an optimization objective, something which is being optimized for directly. And that basically matches the usage here.
- Sheikh Abdur Raheem Ali 11 Apr 2024 2:18 UTC
  3 points
  0
  Parent
  I had to mull over it for five days, hunt down some background materials to fill in context, write follow up questions to a few friends (reviewing responses over phone while commuting), and then slowly chew through the math on pencil and paper when I could get spare time… but yes I understand now!