Yes, plan A is for the AI to be corrigible because of uncertainty about human values and about the accuracy of its own reasoning (and which actively seeks feedback for the same reason). The question is how to set things up so that that happens. We have some rough idea but concrete existing proposals don’t quite work.
I think plan B is for the AI to understand and satisfy human short-term preferences (including the preference for the AI to follow direct instructions, to not kill anyone or do anything serious and irreversible, to gather information that is relevant to understanding our preferences...). Realistically I think this will probably be the most robust measure, and we would use it even if we expect plan A to work.
The kind of utility-function surgery from this post is at best plan C.
Yes, plan A is for the AI to be corrigible because of uncertainty about human values and about the accuracy of its own reasoning (and which actively seeks feedback for the same reason). The question is how to set things up so that that happens. We have some rough idea but concrete existing proposals don’t quite work.
I think plan B is for the AI to understand and satisfy human short-term preferences (including the preference for the AI to follow direct instructions, to not kill anyone or do anything serious and irreversible, to gather information that is relevant to understanding our preferences...). Realistically I think this will probably be the most robust measure, and we would use it even if we expect plan A to work.
The kind of utility-function surgery from this post is at best plan C.
The better our understanding, the less need for utility function surgery (and vice versa).