FAI via golden rule: Done right, this would end up looking like Inverse Reinforcement Learning, which we can’t make work because it doesn’t learn values we would be happy optimizing, only some values that would cause them to act as they do in the current context. I think there’s just no way to avoid the hard work of figuring out, ourselves, a good way for the AI to learn human values. This is definitely something people have thought about in the past and are still thinking about, to try and get it to work.
FAI via multuple competing agents: One agent will probably find a loophole and then the whole scheme had no effect. If your scheme really works, it should work even better with just one agent.
Whitelisting: either produces a dumb agent, too computationally difficult. May require solving the difficult problems in order to generate the whitelist.
Evolution: Will produce AIs that do the equivalent of using condoms—they don’t want what evolution wants, it merely correlated in the ancestral environment.
Fragility/robustness: Helps maintain value alignment once you have it, but doesn’t help get there in the first place.
Thanks! I knew people had essentially devised these ideas before (and if they had instantly worked we would have solved FAI already), but think there is something to be gained via a reinterpretation of the ideas in the RRM. For example, if the human value function derives from discoverable symmetries of neural structure and external environment, then we can do the work to discover these and directly impose them in the agent architecture. And I think the statement I just made is not trivially equivalent to telling people “find human rewards and put them in the agent” (which is literally the whole FAI problem all over again). Symmetry is an empirically discoverable property and also a strong constraint for optimization purposes. Under symmetric constraints the agent still needs to learn human values, but may have an easier time of it. Anyway, clearly I’ve not done a great job communicating, and the ideas are all in intuition stage. Maybe in the future I’ll actually try to prove a reinforcement learning theorem using RRM philosophy.
Sure, just my quick reactions:
FAI via golden rule: Done right, this would end up looking like Inverse Reinforcement Learning, which we can’t make work because it doesn’t learn values we would be happy optimizing, only some values that would cause them to act as they do in the current context. I think there’s just no way to avoid the hard work of figuring out, ourselves, a good way for the AI to learn human values. This is definitely something people have thought about in the past and are still thinking about, to try and get it to work.
FAI via multuple competing agents: One agent will probably find a loophole and then the whole scheme had no effect. If your scheme really works, it should work even better with just one agent.
Whitelisting: either produces a dumb agent, too computationally difficult. May require solving the difficult problems in order to generate the whitelist.
Evolution: Will produce AIs that do the equivalent of using condoms—they don’t want what evolution wants, it merely correlated in the ancestral environment.
Fragility/robustness: Helps maintain value alignment once you have it, but doesn’t help get there in the first place.
Thanks! I knew people had essentially devised these ideas before (and if they had instantly worked we would have solved FAI already), but think there is something to be gained via a reinterpretation of the ideas in the RRM. For example, if the human value function derives from discoverable symmetries of neural structure and external environment, then we can do the work to discover these and directly impose them in the agent architecture. And I think the statement I just made is not trivially equivalent to telling people “find human rewards and put them in the agent” (which is literally the whole FAI problem all over again). Symmetry is an empirically discoverable property and also a strong constraint for optimization purposes. Under symmetric constraints the agent still needs to learn human values, but may have an easier time of it. Anyway, clearly I’ve not done a great job communicating, and the ideas are all in intuition stage. Maybe in the future I’ll actually try to prove a reinforcement learning theorem using RRM philosophy.