If you can come up with a prior that can learn human preferences, why put that prior into a superintelligent agent instead of first updating it to match human preferences? It seems like the latter could be safer as one could then investigate the learned preferences directly, and as one then doesn’t have to deal with it making mistakes before it has learned much.
My immediate reaction is: you should definitely update as far as you can and do this investigation! But no matter how much you investigate the learned preferences, you should still deploy your AI with some residual uncertainty because it’s unlikely you can update it “all the way”. Two reasons why this might be
Some of the data you will need to update all the way will require the superintelligent agent’s help to collect—e.g. collecting human preferences about the specifics of far future interstellar colonization seems impossible right now because we don’t know what is technologically feasible.
You might decide that the human preferences we really care about are the outcomes of some very long-running process like the Long Reflection; then you can’t investigate the learned preferences ahead of time, but in the meantime still want to create superintelligences that safeguard the Long Reflection until it completes.
If you can come up with a prior that can learn human preferences, why put that prior into a superintelligent agent instead of first updating it to match human preferences? It seems like the latter could be safer as one could then investigate the learned preferences directly, and as one then doesn’t have to deal with it making mistakes before it has learned much.
My immediate reaction is: you should definitely update as far as you can and do this investigation! But no matter how much you investigate the learned preferences, you should still deploy your AI with some residual uncertainty because it’s unlikely you can update it “all the way”. Two reasons why this might be
Some of the data you will need to update all the way will require the superintelligent agent’s help to collect—e.g. collecting human preferences about the specifics of far future interstellar colonization seems impossible right now because we don’t know what is technologically feasible.
You might decide that the human preferences we really care about are the outcomes of some very long-running process like the Long Reflection; then you can’t investigate the learned preferences ahead of time, but in the meantime still want to create superintelligences that safeguard the Long Reflection until it completes.