paulfchristiano comments on Counterfactual Oracles = online supervised learning with random selection of training episodes

paulfchristiano 10 Sep 2019 15:28 UTC
LW: 3 AF: 2
AF
For example, consider the following episodic learning algorithm
When I talk about an episodic learning algorithm, I usually mean one that actually optimizes performance within an episode (like most of the algorithms in common use today, e.g. empirical risk minimization treating episode initial conditions as fixed). The algorithm you described doesn’t seem like an “episodic” learning algorithm, given that it optimizes total performance (and essentially ignores episode boundaries).
- Wei Dai 10 Sep 2019 15:43 UTC
  LW: 4 AF: 4
  AF Parent
  (This comment has been heavily edited after posting.)
  
  What’s an algorithm, or instructions for a human, for determining whether a learning algorithm is “episodic” or not? For example it wasn’t obvious to me that Ofer’s algorithm isn’t episodic and I had to think for a while (mentally simulate his algorithm) to see that what he said is correct. Is there a shortcut to figuring out whether a learning algorithm is episodic without having to run or simulate the algorithm? You mention “ignores episode boundaries” but I don’t see how to tell that Ofer’s algorithm ignores episode boundaries since it seems to be just looking at the current episode’s performance when making a decision.
  - paulfchristiano 11 Sep 2019 16:01 UTC
    LW: 4 AF: 3
    AF Parent
    How do you even tell that an algorithm is optimizing something?
    In most cases we have some argument that an algorithm is optimizing the episodic reward, and it just comes down to the details of that argument.
    If you are concerned with optimization that isn’t necessarily intended and wondering how to more effectively look out for it, it seems like you should ask “would a policy that has property P be more likely to be produced under this algorithm?” For P=”takes actions that lead to high rewards in future episodes” the answer is clearly yes, since any policy that persists for a long time necessarily has property P (though of course it’s unclear if the algorithm works at all). For normal RL algorithms there’s not any obvious mechanism by which this would happen. It’s not obvious that it doesn’t, until you prove that these algorithms converge to optimizing per-episode rewards. I don’t see any mechanical way to test that (just like I don’t see any mechanical way to test almost any property that we talk about in almost any argument about anything).
    - Wei Dai 11 Sep 2019 17:52 UTC
      LW: 2 AF: 2
      AF Parent
      
      It’s not obvious that it doesn’t, until you prove that these algorithms converge to optimizing per-episode rewards.
      
      So when you wrote “When I talk about an episodic learning algorithm, I usually mean one that actually optimizes performance within an episode (like most of the algorithms in common use today, e.g. empirical risk minimization treating episode initial conditions as fixed).” earlier, you had in mind that most of the algorithms in common use today have already been proven to converge to optimizing per-episode rewards? If so, I didn’t know that background fact and misinterpreted you as a result. Can you or someone else please explicitly confirm or disconfirm this for me?
      - paulfchristiano 11 Sep 2019 19:06 UTC
        LW: 4 AF: 3
        AF Parent
        Yes, most of the algorithms in use today are known to converge or roughly converge to optimizing per-episode rewards. In most cases it’s relatively clear that there is no optimization across episode boundaries (by the outer optimizer).