Anon User comments on How should AIs update a prior over human preferences?

Anon User 16 May 2020 4:25 UTC
1 point
I wonder whether you may be conflating two somewhat distinct (perhaps even orthogonal) challenges not modeled in the CIDR model:
- Human actions may be reflecting human values very imperfectly (or worse—can be an imperfect reflection of inconsistent conflicting values).
- Some actions by AI may damage the human, at which point the human actions may stop being meaningfully correlated with the value function. This is a problem that would have still be relevant if we somehow found an ideal human capable of acting on their values in a perfectly rational manner.
The first challenge “only” requires the AI to be better at deducing the “real” values. (“Only” is in quotes because it’s obviously still a major unsolved problem, and “real” is in quotes because it’s not a given what that actually means.). The second challenge is about AI needing to be constrained in its actions even before it knows the value function—but there is at least a whole field of Safe RL on how do do this for much simpler tasks, like learning to move a robotic arm without breaking anything in the process.