Interesting… it seems that this doesn’t necessarily happen if we use online gradient descent instead, because the loss gradient (computed for a single episode) ought to lead away from model parameters that would increase the loss for the current episode and reduce it for future episodes. Is that right, and if so, how can we think more generally about what kinds of learning algorithms will produce episodic optimizers vs cross-episodic optimizers?
Also, what name would you suggest for this problem, if not “inner alignment”? (“Inner alignment” actually seems fine to me, but maybe I can be persuaded that it should be called something else instead.)
I call this problem “non-myopia,” which I think interestingly has both an outer alignment component and an inner alignment component:
If you train using something like population-based training that explicitly incentivizes cross-episode performance, then the resulting non-myopia was an outer alignment failure.
Alternatively, if you train using standard RL/SL/etc. without any PBT, but still get non-myopia, then that’s an inner alignment failure. And I find this failure mode quite plausible: even if your training process isn’t explicitly incentivizing non-myopia, it might be that non-myopic agents are simpler/more natural/easier to find/etc. such that your inductive biases still incentivize them.
even if your training process isn’t explicitly incentivizing non-myopia, it might be that non-myopic agents are simpler/more natural/easier to find/etc. such that your inductive biases still incentivize them.
Oh, so even online gradient descent could generate non-myopic agents with large (or non-negligible) probability because non-myopic agents could be local optima for “current episode performance” and their basins of attraction collectively could be large (or non-negligible) compared to the basins of attraction for myopic agents. So starting with random model parameters one might well end up at a non-myopic agent through online gradient descent. Is this an example of what you mean?
Thinking about this more, this doesn’t actually seem very likely for OGD since there are likely to be model parameters controlling how farsighted the agent is (e.g., its discount rate or planning horizon) so it seems like non-myopic agents are not local optima and OGD would keep going downhill (to more and more myopic agents) until it gets to a fully myopic agent. Does this seem right to you?
Thinking about this more, this doesn’t actually seem very likely for OGD since there are likely to be model parameters controlling how farsighted the agent is (e.g., its discount rate or planning horizon) so it seems like non-myopic agents are not local optima and OGD would keep going downhill (to more and more myopic agents) until it gets to a fully myopic agent. Does this seem right to you?
I don’t think that’s quite right. At least if you look at current RL, it relies on the existence of a strict episode boundary past which the agent isn’t supposed to optimize at all. The discount factor is only per-step within an episode; there isn’t any between-episode discount factor. Thus, if you think that simple agents are likely to care about things beyond just the episode that they’re given, then you get non-myopia. In particular, if you put an agent in an environment with a messy episode boundary (e.g. it’s in the real world such that its actions in one episode have the ability to influence its actions in future episodes), I think the natural generalization for an agent in that situation is to keep using something like its discount factor past the artificial episode boundary created by the training process, which gives you non-myopia.
Hmm, I guess I was mostly thinking about non-myopia in the context of using SL to train a Counterfactual Oracle, which wouldn’t necessarily have steps or a non-zero discount factor within an episode. It seems like the easiest way for non-myopia to arise in this context is if the Oracle tries to optimize across episodes using a between-episode discount factor or just a fixed horizon. But as I argued this doesn’t seem to be a local minimum with regard to current episode loss so it seems like OGD wouldn’t stop here but would keep optimizing the Oracle until it’s not non-myopic anymore.
I’m pretty confused about the context that you’re talking about, but why not also have a zero per-step discount factor to try to rule out the scenario you’re describing, in order to ensure myopia?
ETA: On the other hand, unless we have a general solution to inner alignment, there are so many different ways that inner alignment could fail to be achieved (see here for another example) that we should probably just try to solve inner alignment in general and not try to prevent specific failure modes like this.
Interesting… it seems that this doesn’t necessarily happen if we use online gradient descent instead, because the loss gradient (computed for a single episode) ought to lead away from model parameters that would increase the loss for the current episode and reduce it for future episodes.
I agree, my reasoning above does not apply to gradient descent (I misunderstood this point before reading your comment).
I think it still applies to evolutionary algorithms (which might end up beingrelevant).
how can we think more generally about what kinds of learning algorithms will produce episodic optimizers vs cross-episodic optimizers?
Maybe learning algorithms that have the following property are more likely to yield models with “cross-episodic behavior”:
During training, a parameter’s value is more likely to persist (i.e. end up in the final model) if it causes behavior that is beneficial for future episodes.
Also, what name would you suggest for this problem, if not “inner alignment”?
Interesting… it seems that this doesn’t necessarily happen if we use online gradient descent instead, because the loss gradient (computed for a single episode) ought to lead away from model parameters that would increase the loss for the current episode and reduce it for future episodes. Is that right, and if so, how can we think more generally about what kinds of learning algorithms will produce episodic optimizers vs cross-episodic optimizers?
Also, what name would you suggest for this problem, if not “inner alignment”? (“Inner alignment” actually seems fine to me, but maybe I can be persuaded that it should be called something else instead.)
I call this problem “non-myopia,” which I think interestingly has both an outer alignment component and an inner alignment component:
If you train using something like population-based training that explicitly incentivizes cross-episode performance, then the resulting non-myopia was an outer alignment failure.
Alternatively, if you train using standard RL/SL/etc. without any PBT, but still get non-myopia, then that’s an inner alignment failure. And I find this failure mode quite plausible: even if your training process isn’t explicitly incentivizing non-myopia, it might be that non-myopic agents are simpler/more natural/easier to find/etc. such that your inductive biases still incentivize them.
Oh, so even online gradient descent could generate non-myopic agents with large (or non-negligible) probability because non-myopic agents could be local optima for “current episode performance” and their basins of attraction collectively could be large (or non-negligible) compared to the basins of attraction for myopic agents. So starting with random model parameters one might well end up at a non-myopic agent through online gradient descent. Is this an example of what you mean?
Thinking about this more, this doesn’t actually seem very likely for OGD since there are likely to be model parameters controlling how farsighted the agent is (e.g., its discount rate or planning horizon) so it seems like non-myopic agents are not local optima and OGD would keep going downhill (to more and more myopic agents) until it gets to a fully myopic agent. Does this seem right to you?
I don’t think that’s quite right. At least if you look at current RL, it relies on the existence of a strict episode boundary past which the agent isn’t supposed to optimize at all. The discount factor is only per-step within an episode; there isn’t any between-episode discount factor. Thus, if you think that simple agents are likely to care about things beyond just the episode that they’re given, then you get non-myopia. In particular, if you put an agent in an environment with a messy episode boundary (e.g. it’s in the real world such that its actions in one episode have the ability to influence its actions in future episodes), I think the natural generalization for an agent in that situation is to keep using something like its discount factor past the artificial episode boundary created by the training process, which gives you non-myopia.
Hmm, I guess I was mostly thinking about non-myopia in the context of using SL to train a Counterfactual Oracle, which wouldn’t necessarily have steps or a non-zero discount factor within an episode. It seems like the easiest way for non-myopia to arise in this context is if the Oracle tries to optimize across episodes using a between-episode discount factor or just a fixed horizon. But as I argued this doesn’t seem to be a local minimum with regard to current episode loss so it seems like OGD wouldn’t stop here but would keep optimizing the Oracle until it’s not non-myopic anymore.
I’m pretty confused about the context that you’re talking about, but why not also have a zero per-step discount factor to try to rule out the scenario you’re describing, in order to ensure myopia?
ETA: On the other hand, unless we have a general solution to inner alignment, there are so many different ways that inner alignment could fail to be achieved (see here for another example) that we should probably just try to solve inner alignment in general and not try to prevent specific failure modes like this.
I agree, my reasoning above does not apply to gradient descent (I misunderstood this point before reading your comment).
I think it still applies to evolutionary algorithms (which might end up being relevant).
Maybe learning algorithms that have the following property are more likely to yield models with “cross-episodic behavior”:
During training, a parameter’s value is more likely to persist (i.e. end up in the final model) if it causes behavior that is beneficial for future episodes.
Maybe “non-myopia” as Evan suggested.