So, if a distribution shift causes a problem but there’s no inner optimizer, would that count as an inner alignment problem?
Note that I’m only defining outer alignment here, not inner alignment. I think inner alignment is actually more difficult to define than outer alignment. In “Risks from Learned Optimization,” we define it as whether the mesa-optimizer’s mesa-objective is aligned with the base objective, but that definition only makes sense if you actually have a mesa-optimizer. If you want a more general definition, you could just define inner alignment as whether the model actually produced by the training process has good behavior according to the base objective.
Also, infinite data does not at all imply that the training distribution is the same as the deployment distribution. Even if an inner optimizer defects on some data point, that point can be rare in the training distribution but more common in deployment, resulting in a high reward in training but bad behavior in deployment.
I mean “infinite data” to include both all training data and all deployment data. In terms of what distribution, at least if we’re doing supervised learning (e.g. imitative amplification) it shouldn’t matter, since I’m asking for perfectly optimal performance, so it needs to be optimal for every data point. It gets a bit more complicated for RL, but you can just say that it needs to be optimal relative to the deployment MDP.
Ok, I buy this as a valid definition. Can you give some intuition for why it’s useful? Once we divorce the notion of “inner alignment” from inner optimizers, I do not see the point of having a name for it—the language overload will just going to lead to an unfounded intuition that inner optimizers are the main problem.
It’s worth noting that “inner optimizer” isn’t a term we used in “Risks from Learned Optimization” nor is it a term I use myself or encourage others to use—rather, we used mesa-optimizer, which doesn’t have the same parallel with inner alignment. In hindsight, we probably should have separated things out and used inner alignment exclusively for the general definition and “mesa-alignment” or some other similar term for the more specific one (perhaps I should try to write a post attempting to fix this).
I agree with John that we should define inner alignment only in contexts where mesa-optimisers exist, as you do in Risks from Learned Optimization (and as I do in AGI safety from first principles); that is, inner alignment = mesa-alignment. Under the “more general” definition you propose here, inner alignment includes any capabilities problem and any robustness problem, so it’s not really about “alignment”.
I think it’s possible to formulate definitions of inner alignment that don’t rely on mesa-optimizers but exclude capabilities (e.g. by using concepts like 2-D robustness), though I agree that it gets a lot trickier when you try to do that, which was a major motivator for why we just went with the very restrictive mechanistic definition that we ended up using in Risks.
UPDATE: after discussion in the comments, I think the root of the disagreements I had with Evan and Richard is that they’re thinking of “inner alignment” in a way which does not necessarily involve any inner optimizer at all. They’re thinking of generalization error as “inner alignment failure” essentially by definition, regardless of whether there’s any inner optimizer involved. Conversely, they think of “outer alignment” in a way which ignores generalization errors.
I don’t think this is true. I never said that inner alignment didn’t involve mesa-optimizers—in fact, I explicitly said previously that you could define it in the more specific way or the more general way. My point here is just that I want to define outer alignment as being specifically about the objective function in the abstract and what it incentivizes in the limit of infinite data.
Do you agree that (a) you’re thinking of “outer alignment” in a way which excludes generalization error by definition, and (b) generalization error can occur regardless of whether any inner optimizer is present?
Yes—I agree with both (a) and (b). I just don’t think that outer and inner alignment cover the full space of alignment problems. See this post I just published for more detail.
Note that I’m only defining outer alignment here, not inner alignment. I think inner alignment is actually more difficult to define than outer alignment. In “Risks from Learned Optimization,” we define it as whether the mesa-optimizer’s mesa-objective is aligned with the base objective, but that definition only makes sense if you actually have a mesa-optimizer. If you want a more general definition, you could just define inner alignment as whether the model actually produced by the training process has good behavior according to the base objective.
I mean “infinite data” to include both all training data and all deployment data. In terms of what distribution, at least if we’re doing supervised learning (e.g. imitative amplification) it shouldn’t matter, since I’m asking for perfectly optimal performance, so it needs to be optimal for every data point. It gets a bit more complicated for RL, but you can just say that it needs to be optimal relative to the deployment MDP.
Ok, I buy this as a valid definition. Can you give some intuition for why it’s useful? Once we divorce the notion of “inner alignment” from inner optimizers, I do not see the point of having a name for it—the language overload will just going to lead to an unfounded intuition that inner optimizers are the main problem.
It’s worth noting that “inner optimizer” isn’t a term we used in “Risks from Learned Optimization” nor is it a term I use myself or encourage others to use—rather, we used mesa-optimizer, which doesn’t have the same parallel with inner alignment. In hindsight, we probably should have separated things out and used inner alignment exclusively for the general definition and “mesa-alignment” or some other similar term for the more specific one (perhaps I should try to write a post attempting to fix this).
I agree with John that we should define inner alignment only in contexts where mesa-optimisers exist, as you do in Risks from Learned Optimization (and as I do in AGI safety from first principles); that is, inner alignment = mesa-alignment. Under the “more general” definition you propose here, inner alignment includes any capabilities problem and any robustness problem, so it’s not really about “alignment”.
I think it’s possible to formulate definitions of inner alignment that don’t rely on mesa-optimizers but exclude capabilities (e.g. by using concepts like 2-D robustness), though I agree that it gets a lot trickier when you try to do that, which was a major motivator for why we just went with the very restrictive mechanistic definition that we ended up using in Risks.
Ok, I just updated the end of the OP to account for these definitions. Let me know if the reworded argument makes sense.
I don’t think this is true. I never said that inner alignment didn’t involve mesa-optimizers—in fact, I explicitly said previously that you could define it in the more specific way or the more general way. My point here is just that I want to define outer alignment as being specifically about the objective function in the abstract and what it incentivizes in the limit of infinite data.
Do you agree that (a) you’re thinking of “outer alignment” in a way which excludes generalization error by definition, and (b) generalization error can occur regardless of whether any inner optimizer is present?
Yes—I agree with both (a) and (b). I just don’t think that outer and inner alignment cover the full space of alignment problems. See this post I just published for more detail.
Oh excellent, glad to see a fresh post on it.