AnthonyC comments on On the Confusion between Inner and Outer Misalignment

AnthonyC 25 Mar 2024 13:28 UTC
2 points
−2
My mental shorthand for this has been that outer alignment is getting the AI to know what we want it to do, and inner alignment is getting it to care. Like the difference between knowing how to pass a math test, and wanting to become a mathematician. Is that understanding different from what you’re describing here?
- Chris_Leong 25 Mar 2024 13:38 UTC
  4 points
  0
  Parent
  I don’t think that’s quite accurate: any sufficiently powerful AI will know what we want it to do.
  - AnthonyC 25 Mar 2024 13:52 UTC
    2 points
    0
    Parent
    Yes, true, and often probably better than we would be able to write it down, too.
    I was under the impression that this meant that a sufficiently powerful AI would be outer-aligned by default, and that this is what enables several of the kinds of deceptions and other dangers we’re worried about.
    Is the difference between the goal being specified by humans vs being learned and assumed by the AI itself?
    - Chris_Leong 25 Mar 2024 14:20 UTC
      4 points
      2
      Parent
      I was under the impression that this meant that a sufficiently powerful AI would be outer-aligned by default, and that this is what enables several of the kinds of deceptions and other dangers we’re worried about.
      
      This would be the case if inner alignment meant what you think it does, but it doesn’t.
      Is the difference between the goal being specified by humans vs being learned and assumed by the AI itself?
      Yeah, outer alignment is focused on whether we can define what we want the AI to learn (ie. write down a reward function). Inner alignment focused on what the learned artifact (the AI) ends up learning to pursue.