A myopic objective requires an extra distinction to say “don’t continue past the end of the episode”
Something about online learning
The online learning argument is actually super complicated and dependent on a bunch of factors, so I’m not going to summarize that one here. So I’ve just added the other one:
Even if training on a myopic base objective, we might expect the mesa objective to be non-myopic, as the non-myopic objective “pursue X” is simpler than the myopic objective “pursue X until time T”.
Also my biggest take-away was the argument for why we shouldn’t expect myopia by default. But perhaps this was already obvious to others.
So my understanding is there are two arguments:
A myopic objective requires an extra distinction to say “don’t continue past the end of the episode”
Something about online learning
The online learning argument is actually super complicated and dependent on a bunch of factors, so I’m not going to summarize that one here. So I’ve just added the other one: