AnthonyC comments on On the Confusion between Inner and Outer Misalignment

AnthonyC 25 Mar 2024 13:52 UTC
2 points
0
Yes, true, and often probably better than we would be able to write it down, too.
I was under the impression that this meant that a sufficiently powerful AI would be outer-aligned by default, and that this is what enables several of the kinds of deceptions and other dangers we’re worried about.
Is the difference between the goal being specified by humans vs being learned and assumed by the AI itself?
- Chris_Leong 25 Mar 2024 14:20 UTC
  4 points
  2
  Parent
  I was under the impression that this meant that a sufficiently powerful AI would be outer-aligned by default, and that this is what enables several of the kinds of deceptions and other dangers we’re worried about.
  
  This would be the case if inner alignment meant what you think it does, but it doesn’t.
  Is the difference between the goal being specified by humans vs being learned and assumed by the AI itself?
  Yeah, outer alignment is focused on whether we can define what we want the AI to learn (ie. write down a reward function). Inner alignment focused on what the learned artifact (the AI) ends up learning to pursue.