A myopic objective requires an extra distinction to say “don’t continue past the end of the episode”
Something about online learning
The online learning argument is actually super complicated and dependent on a bunch of factors, so I’m not going to summarize that one here. So I’ve just added the other one:
Even if training on a myopic base objective, we might expect the mesa objective to be non-myopic, as the non-myopic objective “pursue X” is simpler than the myopic objective “pursue X until time T”.
So my understanding is there are two arguments:
A myopic objective requires an extra distinction to say “don’t continue past the end of the episode”
Something about online learning
The online learning argument is actually super complicated and dependent on a bunch of factors, so I’m not going to summarize that one here. So I’ve just added the other one: