Yes, true, and often probably better than we would be able to write it down, too.
I was under the impression that this meant that a sufficiently powerful AI would be outer-aligned by default, and that this is what enables several of the kinds of deceptions and other dangers we’re worried about.
Is the difference between the goal being specified by humans vs being learned and assumed by the AI itself?
I was under the impression that this meant that a sufficiently powerful AI would be outer-aligned by default, and that this is what enables several of the kinds of deceptions and other dangers we’re worried about.
This would be the case if inner alignment meant what you think it does, but it doesn’t.
Is the difference between the goal being specified by humans vs being learned and assumed by the AI itself?
Yeah, outer alignment is focused on whether we can define what we want the AI to learn (ie. write down a reward function). Inner alignment focused on what the learned artifact (the AI) ends up learning to pursue.
I don’t think that’s quite accurate: any sufficiently powerful AI will know what we want it to do.
Yes, true, and often probably better than we would be able to write it down, too.
I was under the impression that this meant that a sufficiently powerful AI would be outer-aligned by default, and that this is what enables several of the kinds of deceptions and other dangers we’re worried about.
Is the difference between the goal being specified by humans vs being learned and assumed by the AI itself?
This would be the case if inner alignment meant what you think it does, but it doesn’t.
Yeah, outer alignment is focused on whether we can define what we want the AI to learn (ie. write down a reward function). Inner alignment focused on what the learned artifact (the AI) ends up learning to pursue.