It’s not obvious that it doesn’t, until you prove that these algorithms converge to optimizing per-episode rewards.
So when you wrote “When I talk about an episodic learning algorithm, I usually mean one that actually optimizes performance within an episode (like most of the algorithms in common use today, e.g. empirical risk minimization treating episode initial conditions as fixed).” earlier, you had in mind that most of the algorithms in common use today have already been proven to converge to optimizing per-episode rewards? If so, I didn’t know that background fact and misinterpreted you as a result. Can you or someone else please explicitly confirm or disconfirm this for me?
Yes, most of the algorithms in use today are known to converge or roughly converge to optimizing per-episode rewards. In most cases it’s relatively clear that there is no optimization across episode boundaries (by the outer optimizer).
So when you wrote “When I talk about an episodic learning algorithm, I usually mean one that actually optimizes performance within an episode (like most of the algorithms in common use today, e.g. empirical risk minimization treating episode initial conditions as fixed).” earlier, you had in mind that most of the algorithms in common use today have already been proven to converge to optimizing per-episode rewards? If so, I didn’t know that background fact and misinterpreted you as a result. Can you or someone else please explicitly confirm or disconfirm this for me?
Yes, most of the algorithms in use today are known to converge or roughly converge to optimizing per-episode rewards. In most cases it’s relatively clear that there is no optimization across episode boundaries (by the outer optimizer).