paulfchristiano comments on Counterfactual Oracles = online supervised learning with random selection of training episodes

paulfchristiano 11 Sep 2019 19:06 UTC
LW: 4 AF: 3
AF
Yes, most of the algorithms in use today are known to converge or roughly converge to optimizing per-episode rewards. In most cases it’s relatively clear that there is no optimization across episode boundaries (by the outer optimizer).