Sheikh Abdur Raheem Ali comments on Coherence of Caches and Agents

Sheikh Abdur Raheem Ali 2 Apr 2024 9:35 UTC
6 points
0
I don’t understand this part:
”any value function can be maximized by some utility function over short-term outcomes.”
what is the difference between far in the future and near in the future?
- johnswentworth 2 Apr 2024 18:46 UTC
  7 points
  0
  Parent
  Here’s what it would typically look like in a control theory problem.
  There’s a long term utility $u^{L}$ which is a function of the final state $x (T)$ , and a short term utility $u^{S}$ which is a function of time $t$ , the state $x (t)$ at time $t$ , and the action $a (t)$ at time $t$ . (Often the problem is formulated with a discount rate $γ$ , but in this case we’re allowing time-dependent short-term utility, so we can just absorb the discount rate into $u^{S}$ ). The objective is then to maximize
  $u^{L} (x (T)) + \sum_{t} u^{S} (t, x (t), a (t))$
  In that case, the value function $V (x, t)$ is a max over trajectories starting at $(x, t)$ :
  $V (x, t) = {max}_{trajectory} u^{L} (x (T)) + \sum_{τ \geq t} u^{S} (τ, x (τ), a (τ))$
  The key thing to notice is that we can solve that equation for $u^{S} (t, x (t), a (t))$ :
  $u^{S} (t, x (t), a (t)) = {max}_{trajectory} (u^{L} (x (T)) + \sum_{τ > t} u^{S} (τ, x (τ), a (τ))) - V (x, t)$
  So given an arbitrary value function $V$ , we can find a short-term utility function $u^{S}$ which produces that value function by using that equation to compute $u^{S}$ starting from the last timestep and working backwards.
  Thus the claim from the post: for any value function, there exists a short-term utility function which induces that value function.
  What if we restrict to only consider long-term utility, i.e. set $u^{S} = 0$ ? Well, then the value function is no longer so arbitrary. That’s the case considered in the post, where we have constraints which the value function must satisfy regardless of $u^{L}$ .
  Did that clarify?
  - martinkunev 23 Aug 2024 23:50 UTC
    3 points
    0
    Parent
    what’s wrong with calling the “short-term utility function” a “reward function”?
    - johnswentworth 25 Aug 2024 16:34 UTC
      5 points
      0
      Parent
      “Reward function” is a much more general term, which IMO has been overused to the point where it arguably doesn’t even have a clear meaning. “Utility function” is less general: it always connotes an optimization objective, something which is being optimized for directly. And that basically matches the usage here.
  - Sheikh Abdur Raheem Ali 11 Apr 2024 2:18 UTC
    3 points
    0
    Parent
    I had to mull over it for five days, hunt down some background materials to fill in context, write follow up questions to a few friends (reviewing responses over phone while commuting), and then slowly chew through the math on pencil and paper when I could get spare time… but yes I understand now!