Neither of your interpretations is what I was trying to say. It seems like I expressed myself not well enough.
What I was trying to say is that I think outer alignment itself, as defined by you (and maybe also everyone else), is a priori impossible since no physically realizable reward function that is defined solely based on observations rewards only actions that would be chosen by a competent, well-motivated AI. It always also rewards actions that lead to corrupted observations that are consistent with the actions of a benevolent AI. These rewarded actions may come from a misaligned AI.
However, I notice people seem to use the terms of outer and inner alignment a lot, and quite some people seem to try to solve alignment by solving outer and inner alignment separately. Then I was wondering if they use a more refined notion of what outer alignment means, possibly by taking into account the physical capabilities of the agent, and I was trying to ask if something like that has already been written down anywhere.
Oh, I see. I’m not interested in “solving outer alignment” if that means “creating a real-world physical process that outputs numbers that reward good things and punish bad things in all possible situations” (because as you point out it seems far too stringent a requirement).
Then I was wondering if they use a more refined notion of what outer alignment means, possibly by taking into account the physical capabilities of the agent, and I was trying to ask if something like that has already been written down anywhere.
Neither of your interpretations is what I was trying to say. It seems like I expressed myself not well enough.
What I was trying to say is that I think outer alignment itself, as defined by you (and maybe also everyone else), is a priori impossible since no physically realizable reward function that is defined solely based on observations rewards only actions that would be chosen by a competent, well-motivated AI. It always also rewards actions that lead to corrupted observations that are consistent with the actions of a benevolent AI. These rewarded actions may come from a misaligned AI.
However, I notice people seem to use the terms of outer and inner alignment a lot, and quite some people seem to try to solve alignment by solving outer and inner alignment separately. Then I was wondering if they use a more refined notion of what outer alignment means, possibly by taking into account the physical capabilities of the agent, and I was trying to ask if something like that has already been written down anywhere.
Oh, I see. I’m not interested in “solving outer alignment” if that means “creating a real-world physical process that outputs numbers that reward good things and punish bad things in all possible situations” (because as you point out it seems far too stringent a requirement).
You could look at ascription universality and ELK. The general mindset is roughly “ensure your reward signal captures everything that the agent knows”; I think the mindset is well captured in mundane solutions to exotic problems.
Thanks a lot for these pointers!