UPDATE: after discussion in the comments, I think the root of the disagreements I had with Evan and Richard is that they’re thinking of “inner alignment” in a way which does not necessarily involve any inner optimizer at all. They’re thinking of generalization error as “inner alignment failure” essentially by definition, regardless of whether there’s any inner optimizer involved. Conversely, they think of “outer alignment” in a way which ignores generalization errors.
I don’t think this is true. I never said that inner alignment didn’t involve mesa-optimizers—in fact, I explicitly said previously that you could define it in the more specific way or the more general way. My point here is just that I want to define outer alignment as being specifically about the objective function in the abstract and what it incentivizes in the limit of infinite data.
Do you agree that (a) you’re thinking of “outer alignment” in a way which excludes generalization error by definition, and (b) generalization error can occur regardless of whether any inner optimizer is present?
Yes—I agree with both (a) and (b). I just don’t think that outer and inner alignment cover the full space of alignment problems. See this post I just published for more detail.
I don’t think this is true. I never said that inner alignment didn’t involve mesa-optimizers—in fact, I explicitly said previously that you could define it in the more specific way or the more general way. My point here is just that I want to define outer alignment as being specifically about the objective function in the abstract and what it incentivizes in the limit of infinite data.
Do you agree that (a) you’re thinking of “outer alignment” in a way which excludes generalization error by definition, and (b) generalization error can occur regardless of whether any inner optimizer is present?
Yes—I agree with both (a) and (b). I just don’t think that outer and inner alignment cover the full space of alignment problems. See this post I just published for more detail.
Oh excellent, glad to see a fresh post on it.