The equivalence doesn’t just hold in a few cases—in fact, every function M which myopically assigns a value to all state-action pairs is the optimal Q-function for some reward function. So for any myopic training setup, there’s some equivalent nonmyopic training setup—specifically, the one with reward function R(s,a,s′)=M(s,a)−λmaxa′(M(s′,a′)).
As an aside, you can’t rationalize all M like this if you restrict yourself to state-/outcome-based reward functions. Relevant to the main point of the section.
As an aside, you can’t rationalize all M like this if you restrict yourself to state-/outcome-based reward functions. Relevant to the main point of the section.