tailcalled comments on Updated Deference is not a strong argument against the utility uncertainty approach to alignment

tailcalled 27 Jun 2022 9:02 UTC
2 points
AF
If you can come up with a prior that can learn human preferences, why put that prior into a superintelligent agent instead of first updating it to match human preferences? It seems like the latter could be safer as one could then investigate the learned preferences directly, and as one then doesn’t have to deal with it making mistakes before it has learned much.
- Ivan Vendrov 27 Jun 2022 22:02 UTC
  2 points
  Parent
  My immediate reaction is: you should definitely update as far as you can and do this investigation! But no matter how much you investigate the learned preferences, you should still deploy your AI with some residual uncertainty because it’s unlikely you can update it “all the way”. Two reasons why this might be
  - Some of the data you will need to update all the way will require the superintelligent agent’s help to collect—e.g. collecting human preferences about the specifics of far future interstellar colonization seems impossible right now because we don’t know what is technologically feasible.
  - You might decide that the human preferences we really care about are the outcomes of some very long-running process like the Long Reflection; then you can’t investigate the learned preferences ahead of time, but in the meantime still want to create superintelligences that safeguard the Long Reflection until it completes.