evhub comments on Counterfactual Oracles = online supervised learning with random selection of training episodes

evhub 10 Sep 2019 22:51 UTC
LW: 3 AF: 3
AF

Thinking about this more, this doesn’t actually seem very likely for OGD since there are likely to be model parameters controlling how farsighted the agent is (e.g., its discount rate or planning horizon) so it seems like non-myopic agents are not local optima and OGD would keep going downhill (to more and more myopic agents) until it gets to a fully myopic agent. Does this seem right to you?

I don’t think that’s quite right. At least if you look at current RL, it relies on the existence of a strict episode boundary past which the agent isn’t supposed to optimize at all. The discount factor is only per-step within an episode; there isn’t any between-episode discount factor. Thus, if you think that simple agents are likely to care about things beyond just the episode that they’re given, then you get non-myopia. In particular, if you put an agent in an environment with a messy episode boundary (e.g. it’s in the real world such that its actions in one episode have the ability to influence its actions in future episodes), I think the natural generalization for an agent in that situation is to keep using something like its discount factor past the artificial episode boundary created by the training process, which gives you non-myopia.
- Wei Dai 11 Sep 2019 0:29 UTC
  LW: 3 AF: 3
  AF Parent
  Hmm, I guess I was mostly thinking about non-myopia in the context of using SL to train a Counterfactual Oracle, which wouldn’t necessarily have steps or a non-zero discount factor within an episode. It seems like the easiest way for non-myopia to arise in this context is if the Oracle tries to optimize across episodes using a between-episode discount factor or just a fixed horizon. But as I argued this doesn’t seem to be a local minimum with regard to current episode loss so it seems like OGD wouldn’t stop here but would keep optimizing the Oracle until it’s not non-myopic anymore.
  
  I’m pretty confused about the context that you’re talking about, but why not also have a zero per-step discount factor to try to rule out the scenario you’re describing, in order to ensure myopia?
  
  ETA: On the other hand, unless we have a general solution to inner alignment, there are so many different ways that inner alignment could fail to be achieved (see here for another example) that we should probably just try to solve inner alignment in general and not try to prevent specific failure modes like this.