A couple minor corrections: in the definition of Qu(h<t+nat+n), there shouldn’t be a max over at+n --that’s an input to the function. Another one, and this isn’t quite as clear cut, is that I think u(ht+n:t+n+m) should be u(h1:t+n+m) in the definition of the Q-value. It seems that you intend u(ht:k) to mean all the utility accrued from time t to time k, but the utility should be allowed to depend on the entire history of observations. The theoretical reason for this is that “really,” the utility is a function of the state of the universe, and all observations inform the agent’s probability distribution over what universe state it is in, not just the observations that come from the interval of time that it is evaluating the utility of. A concrete example is as follows: if an action appeared somewhere in the history that indicated that all observations thereafter were faked, the utility of that segment should reflect that—it should be allowed to depend on the previous observations that contextualize the observations of the interval in question. In other words, a utility function needs to be typed to allow all actions and observations from the whole history to be input to the function.
The second is deliberate; we want this to be about just building favorable strings of observations. It’s fine if this is shallow. We do catch the “fake” case (if you think about it for a while), however, for utilities which “care”.
A couple minor corrections: in the definition of Qu(h<t+nat+n), there shouldn’t be a max over at+n --that’s an input to the function. Another one, and this isn’t quite as clear cut, is that I think u(ht+n:t+n+m) should be u(h1:t+n+m) in the definition of the Q-value. It seems that you intend u(ht:k) to mean all the utility accrued from time t to time k, but the utility should be allowed to depend on the entire history of observations. The theoretical reason for this is that “really,” the utility is a function of the state of the universe, and all observations inform the agent’s probability distribution over what universe state it is in, not just the observations that come from the interval of time that it is evaluating the utility of. A concrete example is as follows: if an action appeared somewhere in the history that indicated that all observations thereafter were faked, the utility of that segment should reflect that—it should be allowed to depend on the previous observations that contextualize the observations of the interval in question. In other words, a utility function needs to be typed to allow all actions and observations from the whole history to be input to the function.
The action one is indeed a typo, thanks!
The second is deliberate; we want this to be about just building favorable strings of observations. It’s fine if this is shallow. We do catch the “fake” case (if you think about it for a while), however, for utilities which “care”.