I wonder whether you may be conflating two somewhat distinct (perhaps even orthogonal) challenges not modeled in the CIDR model:
Human actions may be reflecting human values very imperfectly (or worse—can be an imperfect reflection of inconsistent conflicting values).
Some actions by AI may damage the human, at which point the human actions may stop being meaningfully correlated with the value function. This is a problem that would have still be relevant if we somehow found an ideal human capable of acting on their values in a perfectly rational manner.
The first challenge “only” requires the AI to be better at deducing the “real” values. (“Only” is in quotes because it’s obviously still a major unsolved problem, and “real” is in quotes because it’s not a given what that actually means.). The second challenge is about AI needing to be constrained in its actions even before it knows the value function—but there is at least a whole field of Safe RL on how do do this for much simpler tasks, like learning to move a robotic arm without breaking anything in the process.
I wonder whether you may be conflating two somewhat distinct (perhaps even orthogonal) challenges not modeled in the CIDR model:
Human actions may be reflecting human values very imperfectly (or worse—can be an imperfect reflection of inconsistent conflicting values).
Some actions by AI may damage the human, at which point the human actions may stop being meaningfully correlated with the value function. This is a problem that would have still be relevant if we somehow found an ideal human capable of acting on their values in a perfectly rational manner.
The first challenge “only” requires the AI to be better at deducing the “real” values. (“Only” is in quotes because it’s obviously still a major unsolved problem, and “real” is in quotes because it’s not a given what that actually means.). The second challenge is about AI needing to be constrained in its actions even before it knows the value function—but there is at least a whole field of Safe RL on how do do this for much simpler tasks, like learning to move a robotic arm without breaking anything in the process.