Ok, I buy this as a valid definition. Can you give some intuition for why it’s useful? Once we divorce the notion of “inner alignment” from inner optimizers, I do not see the point of having a name for it—the language overload will just going to lead to an unfounded intuition that inner optimizers are the main problem.
It’s worth noting that “inner optimizer” isn’t a term we used in “Risks from Learned Optimization” nor is it a term I use myself or encourage others to use—rather, we used mesa-optimizer, which doesn’t have the same parallel with inner alignment. In hindsight, we probably should have separated things out and used inner alignment exclusively for the general definition and “mesa-alignment” or some other similar term for the more specific one (perhaps I should try to write a post attempting to fix this).
I agree with John that we should define inner alignment only in contexts where mesa-optimisers exist, as you do in Risks from Learned Optimization (and as I do in AGI safety from first principles); that is, inner alignment = mesa-alignment. Under the “more general” definition you propose here, inner alignment includes any capabilities problem and any robustness problem, so it’s not really about “alignment”.
I think it’s possible to formulate definitions of inner alignment that don’t rely on mesa-optimizers but exclude capabilities (e.g. by using concepts like 2-D robustness), though I agree that it gets a lot trickier when you try to do that, which was a major motivator for why we just went with the very restrictive mechanistic definition that we ended up using in Risks.
UPDATE: after discussion in the comments, I think the root of the disagreements I had with Evan and Richard is that they’re thinking of “inner alignment” in a way which does not necessarily involve any inner optimizer at all. They’re thinking of generalization error as “inner alignment failure” essentially by definition, regardless of whether there’s any inner optimizer involved. Conversely, they think of “outer alignment” in a way which ignores generalization errors.
I don’t think this is true. I never said that inner alignment didn’t involve mesa-optimizers—in fact, I explicitly said previously that you could define it in the more specific way or the more general way. My point here is just that I want to define outer alignment as being specifically about the objective function in the abstract and what it incentivizes in the limit of infinite data.
Do you agree that (a) you’re thinking of “outer alignment” in a way which excludes generalization error by definition, and (b) generalization error can occur regardless of whether any inner optimizer is present?
Yes—I agree with both (a) and (b). I just don’t think that outer and inner alignment cover the full space of alignment problems. See this post I just published for more detail.
Ok, I buy this as a valid definition. Can you give some intuition for why it’s useful? Once we divorce the notion of “inner alignment” from inner optimizers, I do not see the point of having a name for it—the language overload will just going to lead to an unfounded intuition that inner optimizers are the main problem.
It’s worth noting that “inner optimizer” isn’t a term we used in “Risks from Learned Optimization” nor is it a term I use myself or encourage others to use—rather, we used mesa-optimizer, which doesn’t have the same parallel with inner alignment. In hindsight, we probably should have separated things out and used inner alignment exclusively for the general definition and “mesa-alignment” or some other similar term for the more specific one (perhaps I should try to write a post attempting to fix this).
I agree with John that we should define inner alignment only in contexts where mesa-optimisers exist, as you do in Risks from Learned Optimization (and as I do in AGI safety from first principles); that is, inner alignment = mesa-alignment. Under the “more general” definition you propose here, inner alignment includes any capabilities problem and any robustness problem, so it’s not really about “alignment”.
I think it’s possible to formulate definitions of inner alignment that don’t rely on mesa-optimizers but exclude capabilities (e.g. by using concepts like 2-D robustness), though I agree that it gets a lot trickier when you try to do that, which was a major motivator for why we just went with the very restrictive mechanistic definition that we ended up using in Risks.
Ok, I just updated the end of the OP to account for these definitions. Let me know if the reworded argument makes sense.
I don’t think this is true. I never said that inner alignment didn’t involve mesa-optimizers—in fact, I explicitly said previously that you could define it in the more specific way or the more general way. My point here is just that I want to define outer alignment as being specifically about the objective function in the abstract and what it incentivizes in the limit of infinite data.
Do you agree that (a) you’re thinking of “outer alignment” in a way which excludes generalization error by definition, and (b) generalization error can occur regardless of whether any inner optimizer is present?
Yes—I agree with both (a) and (b). I just don’t think that outer and inner alignment cover the full space of alignment problems. See this post I just published for more detail.
Oh excellent, glad to see a fresh post on it.