Wei Dai comments on Counterfactual Oracles = online supervised learning with random selection of training episodes

Wei Dai 11 Sep 2019 0:29 UTC
LW: 3 AF: 3
AF
Hmm, I guess I was mostly thinking about non-myopia in the context of using SL to train a Counterfactual Oracle, which wouldn’t necessarily have steps or a non-zero discount factor within an episode. It seems like the easiest way for non-myopia to arise in this context is if the Oracle tries to optimize across episodes using a between-episode discount factor or just a fixed horizon. But as I argued this doesn’t seem to be a local minimum with regard to current episode loss so it seems like OGD wouldn’t stop here but would keep optimizing the Oracle until it’s not non-myopic anymore.

I’m pretty confused about the context that you’re talking about, but why not also have a zero per-step discount factor to try to rule out the scenario you’re describing, in order to ensure myopia?

ETA: On the other hand, unless we have a general solution to inner alignment, there are so many different ways that inner alignment could fail to be achieved (see here for another example) that we should probably just try to solve inner alignment in general and not try to prevent specific failure modes like this.