Oh, I see. I’m not interested in “solving outer alignment” if that means “creating a real-world physical process that outputs numbers that reward good things and punish bad things in all possible situations” (because as you point out it seems far too stringent a requirement).
Then I was wondering if they use a more refined notion of what outer alignment means, possibly by taking into account the physical capabilities of the agent, and I was trying to ask if something like that has already been written down anywhere.
Oh, I see. I’m not interested in “solving outer alignment” if that means “creating a real-world physical process that outputs numbers that reward good things and punish bad things in all possible situations” (because as you point out it seems far too stringent a requirement).
You could look at ascription universality and ELK. The general mindset is roughly “ensure your reward signal captures everything that the agent knows”; I think the mindset is well captured in mundane solutions to exotic problems.
Thanks a lot for these pointers!