I think humans and aligned AGIs are only ever very indirect pointers to preference (value, utility function), and it makes no sense to talk of authoritative/normative utility functions directly relating to their behavior, or describing it other than through this very indirect extrapolation process that takes ages and probably doesn’t make sense either as a thing that can be fully completed.
The utility functions/values that do describe/guide behavior are approximations that are knowably and desirably reflectively unstable, that should keep changing on reflection. As such, optimizing according to them too strongly destroys value and also makes them progressively worse approximations via Goodhart’s Law. An AGI that holds these approximations (proxy goals) as reflectively stable goals is catastrophically misaligned and will destroy value by optimizing for proxy goals past the point where they stop being good approximations of (unknown) intended goals.
So AI alignment is not about alignment of utility functions related to current behavior in any straightforward/useful way. It’s about making sure that optimization is soft and corrigible, that it stops before Goodhart’s Curse starts destroying value, and follows redefinition of value as it grows.
I think humans and aligned AGIs are only ever very indirect pointers to preference (value, utility function), and it makes no sense to talk of authoritative/normative utility functions directly relating to their behavior, or describing it other than through this very indirect extrapolation process that takes ages and probably doesn’t make sense either as a thing that can be fully completed.
The utility functions/values that do describe/guide behavior are approximations that are knowably and desirably reflectively unstable, that should keep changing on reflection. As such, optimizing according to them too strongly destroys value and also makes them progressively worse approximations via Goodhart’s Law. An AGI that holds these approximations (proxy goals) as reflectively stable goals is catastrophically misaligned and will destroy value by optimizing for proxy goals past the point where they stop being good approximations of (unknown) intended goals.
So AI alignment is not about alignment of utility functions related to current behavior in any straightforward/useful way. It’s about making sure that optimization is soft and corrigible, that it stops before Goodhart’s Curse starts destroying value, and follows redefinition of value as it grows.