My mental shorthand for this has been that outer alignment is getting the AI to know what we want it to do, and inner alignment is getting it to care. Like the difference between knowing how to pass a math test, and wanting to become a mathematician. Is that understanding different from what you’re describing here?
Yes, true, and often probably better than we would be able to write it down, too.
I was under the impression that this meant that a sufficiently powerful AI would be outer-aligned by default, and that this is what enables several of the kinds of deceptions and other dangers we’re worried about.
Is the difference between the goal being specified by humans vs being learned and assumed by the AI itself?
I was under the impression that this meant that a sufficiently powerful AI would be outer-aligned by default, and that this is what enables several of the kinds of deceptions and other dangers we’re worried about.
This would be the case if inner alignment meant what you think it does, but it doesn’t.
Is the difference between the goal being specified by humans vs being learned and assumed by the AI itself?
Yeah, outer alignment is focused on whether we can define what we want the AI to learn (ie. write down a reward function). Inner alignment focused on what the learned artifact (the AI) ends up learning to pursue.
My mental shorthand for this has been that outer alignment is getting the AI to know what we want it to do, and inner alignment is getting it to care. Like the difference between knowing how to pass a math test, and wanting to become a mathematician. Is that understanding different from what you’re describing here?
I don’t think that’s quite accurate: any sufficiently powerful AI will know what we want it to do.
Yes, true, and often probably better than we would be able to write it down, too.
I was under the impression that this meant that a sufficiently powerful AI would be outer-aligned by default, and that this is what enables several of the kinds of deceptions and other dangers we’re worried about.
Is the difference between the goal being specified by humans vs being learned and assumed by the AI itself?
This would be the case if inner alignment meant what you think it does, but it doesn’t.
Yeah, outer alignment is focused on whether we can define what we want the AI to learn (ie. write down a reward function). Inner alignment focused on what the learned artifact (the AI) ends up learning to pursue.