johnswentworth comments on “Inner Alignment Failures” Which Are Actually Outer Alignment Failures

johnswentworth 2 Nov 2020 22:42 UTC
LW: 4 AF: 3
AF
Ok, I just updated the end of the OP to account for these definitions. Let me know if the reworded argument makes sense.
- evhub 9 Nov 2020 20:17 UTC
  LW: 4 AF: 3
  AF Parent
  
  UPDATE: after discussion in the comments, I think the root of the disagreements I had with Evan and Richard is that they’re thinking of “inner alignment” in a way which does not necessarily involve any inner optimizer at all. They’re thinking of generalization error as “inner alignment failure” essentially by definition, regardless of whether there’s any inner optimizer involved. Conversely, they think of “outer alignment” in a way which ignores generalization errors.
  
  I don’t think this is true. I never said that inner alignment didn’t involve mesa-optimizers—in fact, I explicitly said previously that you could define it in the more specific way or the more general way. My point here is just that I want to define outer alignment as being specifically about the objective function in the abstract and what it incentivizes in the limit of infinite data.
  - johnswentworth 9 Nov 2020 20:32 UTC
    LW: 5 AF: 4
    AF Parent
    Do you agree that (a) you’re thinking of “outer alignment” in a way which excludes generalization error by definition, and (b) generalization error can occur regardless of whether any inner optimizer is present?
    - evhub 9 Nov 2020 20:41 UTC
      LW: 6 AF: 4
      AF Parent
      Yes—I agree with both (a) and (b). I just don’t think that outer and inner alignment cover the full space of alignment problems. See this post I just published for more detail.
      - johnswentworth 9 Nov 2020 21:03 UTC
        LW: 6 AF: 4
        AF Parent
        Oh excellent, glad to see a fresh post on it.