Wei Dai comments on Counterfactual Oracles = online supervised learning with random selection of training episodes

Wei Dai 11 Sep 2019 17:52 UTC
LW: 2 AF: 2
AF

It’s not obvious that it doesn’t, until you prove that these algorithms converge to optimizing per-episode rewards.

So when you wrote “When I talk about an episodic learning algorithm, I usually mean one that actually optimizes performance within an episode (like most of the algorithms in common use today, e.g. empirical risk minimization treating episode initial conditions as fixed).” earlier, you had in mind that most of the algorithms in common use today have already been proven to converge to optimizing per-episode rewards? If so, I didn’t know that background fact and misinterpreted you as a result. Can you or someone else please explicitly confirm or disconfirm this for me?
- paulfchristiano 11 Sep 2019 19:06 UTC
  LW: 4 AF: 3
  AF Parent
  Yes, most of the algorithms in use today are known to converge or roughly converge to optimizing per-episode rewards. In most cases it’s relatively clear that there is no optimization across episode boundaries (by the outer optimizer).