[EDIT (2019-11-09): I no longer think that the argument I made here—about a theoretical learning algorithm—seems to apply to common practical learning algorithms; see here (H/T Abram for showing me that my reasoning was wrong).]
Inner alignment—The ML training process may not produce a model that actually optimizes for what we intend for it to optimize for (namely minimizing loss for just the current episode, conditional on the current episode being selected as a training episode).
If the trained model tries to minimize loss in future episodes, it definitely seems dangerous, but I’m not sure that we should consider this an inner-alignment failure. In some sense we got the behavior that our episodic learning algorithm was optimizing for.
For example, consider the following episodic learning algorithm: At the end of each episode, if the model failed to achieve the episode’s goal its network parameters are completely randomized (and if it achieves the goal, the model is unchanged). If we run this learning algorithm for an arbitrarily long time, we should expect to end up with a model that behaves in a way that results in achieving the goal in every future episode (if such a model exists).
Interesting… it seems that this doesn’t necessarily happen if we use online gradient descent instead, because the loss gradient (computed for a single episode) ought to lead away from model parameters that would increase the loss for the current episode and reduce it for future episodes. Is that right, and if so, how can we think more generally about what kinds of learning algorithms will produce episodic optimizers vs cross-episodic optimizers?
Also, what name would you suggest for this problem, if not “inner alignment”? (“Inner alignment” actually seems fine to me, but maybe I can be persuaded that it should be called something else instead.)
I call this problem “non-myopia,” which I think interestingly has both an outer alignment component and an inner alignment component:
If you train using something like population-based training that explicitly incentivizes cross-episode performance, then the resulting non-myopia was an outer alignment failure.
Alternatively, if you train using standard RL/SL/etc. without any PBT, but still get non-myopia, then that’s an inner alignment failure. And I find this failure mode quite plausible: even if your training process isn’t explicitly incentivizing non-myopia, it might be that non-myopic agents are simpler/more natural/easier to find/etc. such that your inductive biases still incentivize them.
even if your training process isn’t explicitly incentivizing non-myopia, it might be that non-myopic agents are simpler/more natural/easier to find/etc. such that your inductive biases still incentivize them.
Oh, so even online gradient descent could generate non-myopic agents with large (or non-negligible) probability because non-myopic agents could be local optima for “current episode performance” and their basins of attraction collectively could be large (or non-negligible) compared to the basins of attraction for myopic agents. So starting with random model parameters one might well end up at a non-myopic agent through online gradient descent. Is this an example of what you mean?
Thinking about this more, this doesn’t actually seem very likely for OGD since there are likely to be model parameters controlling how farsighted the agent is (e.g., its discount rate or planning horizon) so it seems like non-myopic agents are not local optima and OGD would keep going downhill (to more and more myopic agents) until it gets to a fully myopic agent. Does this seem right to you?
Thinking about this more, this doesn’t actually seem very likely for OGD since there are likely to be model parameters controlling how farsighted the agent is (e.g., its discount rate or planning horizon) so it seems like non-myopic agents are not local optima and OGD would keep going downhill (to more and more myopic agents) until it gets to a fully myopic agent. Does this seem right to you?
I don’t think that’s quite right. At least if you look at current RL, it relies on the existence of a strict episode boundary past which the agent isn’t supposed to optimize at all. The discount factor is only per-step within an episode; there isn’t any between-episode discount factor. Thus, if you think that simple agents are likely to care about things beyond just the episode that they’re given, then you get non-myopia. In particular, if you put an agent in an environment with a messy episode boundary (e.g. it’s in the real world such that its actions in one episode have the ability to influence its actions in future episodes), I think the natural generalization for an agent in that situation is to keep using something like its discount factor past the artificial episode boundary created by the training process, which gives you non-myopia.
Hmm, I guess I was mostly thinking about non-myopia in the context of using SL to train a Counterfactual Oracle, which wouldn’t necessarily have steps or a non-zero discount factor within an episode. It seems like the easiest way for non-myopia to arise in this context is if the Oracle tries to optimize across episodes using a between-episode discount factor or just a fixed horizon. But as I argued this doesn’t seem to be a local minimum with regard to current episode loss so it seems like OGD wouldn’t stop here but would keep optimizing the Oracle until it’s not non-myopic anymore.
I’m pretty confused about the context that you’re talking about, but why not also have a zero per-step discount factor to try to rule out the scenario you’re describing, in order to ensure myopia?
ETA: On the other hand, unless we have a general solution to inner alignment, there are so many different ways that inner alignment could fail to be achieved (see here for another example) that we should probably just try to solve inner alignment in general and not try to prevent specific failure modes like this.
Interesting… it seems that this doesn’t necessarily happen if we use online gradient descent instead, because the loss gradient (computed for a single episode) ought to lead away from model parameters that would increase the loss for the current episode and reduce it for future episodes.
I agree, my reasoning above does not apply to gradient descent (I misunderstood this point before reading your comment).
I think it still applies to evolutionary algorithms (which might end up beingrelevant).
how can we think more generally about what kinds of learning algorithms will produce episodic optimizers vs cross-episodic optimizers?
Maybe learning algorithms that have the following property are more likely to yield models with “cross-episodic behavior”:
During training, a parameter’s value is more likely to persist (i.e. end up in the final model) if it causes behavior that is beneficial for future episodes.
Also, what name would you suggest for this problem, if not “inner alignment”?
For example, consider the following episodic learning algorithm
When I talk about an episodic learning algorithm, I usually mean one that actually optimizes performance within an episode (like most of the algorithms in common use today, e.g. empirical risk minimization treating episode initial conditions as fixed). The algorithm you described doesn’t seem like an “episodic” learning algorithm, given that it optimizes total performance (and essentially ignores episode boundaries).
(This comment has been heavily edited after posting.)
What’s an algorithm, or instructions for a human, for determining whether a learning algorithm is “episodic” or not? For example it wasn’t obvious to me that Ofer’s algorithm isn’t episodic and I had to think for a while (mentally simulate his algorithm) to see that what he said is correct. Is there a shortcut to figuring out whether a learning algorithm is episodic without having to run or simulate the algorithm? You mention “ignores episode boundaries” but I don’t see how to tell that Ofer’s algorithm ignores episode boundaries since it seems to be just looking at the current episode’s performance when making a decision.
How do you even tell that an algorithm is optimizing something?
In most cases we have some argument that an algorithm is optimizing the episodic reward, and it just comes down to the details of that argument.
If you are concerned with optimization that isn’t necessarily intended and wondering how to more effectively look out for it, it seems like you should ask “would a policy that has property P be more likely to be produced under this algorithm?” For P=”takes actions that lead to high rewards in future episodes” the answer is clearly yes, since any policy that persists for a long time necessarily has property P (though of course it’s unclear if the algorithm works at all). For normal RL algorithms there’s not any obvious mechanism by which this would happen. It’s not obvious that it doesn’t, until you prove that these algorithms converge to optimizing per-episode rewards. I don’t see any mechanical way to test that (just like I don’t see any mechanical way to test almost any property that we talk about in almost any argument about anything).
It’s not obvious that it doesn’t, until you prove that these algorithms converge to optimizing per-episode rewards.
So when you wrote “When I talk about an episodic learning algorithm, I usually mean one that actually optimizes performance within an episode (like most of the algorithms in common use today, e.g. empirical risk minimization treating episode initial conditions as fixed).” earlier, you had in mind that most of the algorithms in common use today have already been proven to converge to optimizing per-episode rewards? If so, I didn’t know that background fact and misinterpreted you as a result. Can you or someone else please explicitly confirm or disconfirm this for me?
Yes, most of the algorithms in use today are known to converge or roughly converge to optimizing per-episode rewards. In most cases it’s relatively clear that there is no optimization across episode boundaries (by the outer optimizer).
[EDIT (2019-11-09): I no longer think that the argument I made here—about a theoretical learning algorithm—seems to apply to common practical learning algorithms; see here (H/T Abram for showing me that my reasoning was wrong).]
If the trained model tries to minimize loss in future episodes, it definitely seems dangerous, but I’m not sure that we should consider this an inner-alignment failure. In some sense we got the behavior that our episodic learning algorithm was optimizing for.
For example, consider the following episodic learning algorithm: At the end of each episode, if the model failed to achieve the episode’s goal its network parameters are completely randomized (and if it achieves the goal, the model is unchanged). If we run this learning algorithm for an arbitrarily long time, we should expect to end up with a model that behaves in a way that results in achieving the goal in every future episode (if such a model exists).
Interesting… it seems that this doesn’t necessarily happen if we use online gradient descent instead, because the loss gradient (computed for a single episode) ought to lead away from model parameters that would increase the loss for the current episode and reduce it for future episodes. Is that right, and if so, how can we think more generally about what kinds of learning algorithms will produce episodic optimizers vs cross-episodic optimizers?
Also, what name would you suggest for this problem, if not “inner alignment”? (“Inner alignment” actually seems fine to me, but maybe I can be persuaded that it should be called something else instead.)
I call this problem “non-myopia,” which I think interestingly has both an outer alignment component and an inner alignment component:
If you train using something like population-based training that explicitly incentivizes cross-episode performance, then the resulting non-myopia was an outer alignment failure.
Alternatively, if you train using standard RL/SL/etc. without any PBT, but still get non-myopia, then that’s an inner alignment failure. And I find this failure mode quite plausible: even if your training process isn’t explicitly incentivizing non-myopia, it might be that non-myopic agents are simpler/more natural/easier to find/etc. such that your inductive biases still incentivize them.
Oh, so even online gradient descent could generate non-myopic agents with large (or non-negligible) probability because non-myopic agents could be local optima for “current episode performance” and their basins of attraction collectively could be large (or non-negligible) compared to the basins of attraction for myopic agents. So starting with random model parameters one might well end up at a non-myopic agent through online gradient descent. Is this an example of what you mean?
Thinking about this more, this doesn’t actually seem very likely for OGD since there are likely to be model parameters controlling how farsighted the agent is (e.g., its discount rate or planning horizon) so it seems like non-myopic agents are not local optima and OGD would keep going downhill (to more and more myopic agents) until it gets to a fully myopic agent. Does this seem right to you?
I don’t think that’s quite right. At least if you look at current RL, it relies on the existence of a strict episode boundary past which the agent isn’t supposed to optimize at all. The discount factor is only per-step within an episode; there isn’t any between-episode discount factor. Thus, if you think that simple agents are likely to care about things beyond just the episode that they’re given, then you get non-myopia. In particular, if you put an agent in an environment with a messy episode boundary (e.g. it’s in the real world such that its actions in one episode have the ability to influence its actions in future episodes), I think the natural generalization for an agent in that situation is to keep using something like its discount factor past the artificial episode boundary created by the training process, which gives you non-myopia.
Hmm, I guess I was mostly thinking about non-myopia in the context of using SL to train a Counterfactual Oracle, which wouldn’t necessarily have steps or a non-zero discount factor within an episode. It seems like the easiest way for non-myopia to arise in this context is if the Oracle tries to optimize across episodes using a between-episode discount factor or just a fixed horizon. But as I argued this doesn’t seem to be a local minimum with regard to current episode loss so it seems like OGD wouldn’t stop here but would keep optimizing the Oracle until it’s not non-myopic anymore.
I’m pretty confused about the context that you’re talking about, but why not also have a zero per-step discount factor to try to rule out the scenario you’re describing, in order to ensure myopia?
ETA: On the other hand, unless we have a general solution to inner alignment, there are so many different ways that inner alignment could fail to be achieved (see here for another example) that we should probably just try to solve inner alignment in general and not try to prevent specific failure modes like this.
I agree, my reasoning above does not apply to gradient descent (I misunderstood this point before reading your comment).
I think it still applies to evolutionary algorithms (which might end up being relevant).
Maybe learning algorithms that have the following property are more likely to yield models with “cross-episodic behavior”:
During training, a parameter’s value is more likely to persist (i.e. end up in the final model) if it causes behavior that is beneficial for future episodes.
Maybe “non-myopia” as Evan suggested.
When I talk about an episodic learning algorithm, I usually mean one that actually optimizes performance within an episode (like most of the algorithms in common use today, e.g. empirical risk minimization treating episode initial conditions as fixed). The algorithm you described doesn’t seem like an “episodic” learning algorithm, given that it optimizes total performance (and essentially ignores episode boundaries).
(This comment has been heavily edited after posting.)
What’s an algorithm, or instructions for a human, for determining whether a learning algorithm is “episodic” or not? For example it wasn’t obvious to me that Ofer’s algorithm isn’t episodic and I had to think for a while (mentally simulate his algorithm) to see that what he said is correct. Is there a shortcut to figuring out whether a learning algorithm is episodic without having to run or simulate the algorithm? You mention “ignores episode boundaries” but I don’t see how to tell that Ofer’s algorithm ignores episode boundaries since it seems to be just looking at the current episode’s performance when making a decision.
How do you even tell that an algorithm is optimizing something?
In most cases we have some argument that an algorithm is optimizing the episodic reward, and it just comes down to the details of that argument.
If you are concerned with optimization that isn’t necessarily intended and wondering how to more effectively look out for it, it seems like you should ask “would a policy that has property P be more likely to be produced under this algorithm?” For P=”takes actions that lead to high rewards in future episodes” the answer is clearly yes, since any policy that persists for a long time necessarily has property P (though of course it’s unclear if the algorithm works at all). For normal RL algorithms there’s not any obvious mechanism by which this would happen. It’s not obvious that it doesn’t, until you prove that these algorithms converge to optimizing per-episode rewards. I don’t see any mechanical way to test that (just like I don’t see any mechanical way to test almost any property that we talk about in almost any argument about anything).
So when you wrote “When I talk about an episodic learning algorithm, I usually mean one that actually optimizes performance within an episode (like most of the algorithms in common use today, e.g. empirical risk minimization treating episode initial conditions as fixed).” earlier, you had in mind that most of the algorithms in common use today have already been proven to converge to optimizing per-episode rewards? If so, I didn’t know that background fact and misinterpreted you as a result. Can you or someone else please explicitly confirm or disconfirm this for me?
Yes, most of the algorithms in use today are known to converge or roughly converge to optimizing per-episode rewards. In most cases it’s relatively clear that there is no optimization across episode boundaries (by the outer optimizer).