I often have thunk thoughts like “Consider an AI with a utility function that is just barely incorrect, such that it doesn’t place any value on boredom. Then the AI optimizes the universe in a bad way.”
One problem with this thought is that it’s not clear that I’m really thinking about anything in particular, anything which actually exists. What am I actually considering in the above quotation? With respect to what, exactly, is the AI’s utility function “incorrect”? Is there a utility function for which its optimal policies are aligned?
For sufficiently expressive utility functions, the answer has to be “yes.” For example, if the utility function is over the AI’s action histories, you can just hardcode a safe, benevolent policy into the AI: utility 0 if the AI has ever taken a bad action, 1 otherwise. Since there presumably exists at least some sequence of AI outputs which leads to wonderful outcomes, this action-history utility function works.
But this is trivial and not what we mean by a “correct” utility function. So, now I’m left with a puzzle. What does it mean for the AI to have a correct utility function? I do not think this is a quibble. The quoted thought seems ungrounded from the substance of the alignment problem.
I think humans and aligned AGIs are only ever very indirect pointers to preference (value, utility function), and it makes no sense to talk of authoritative/normative utility functions directly relating to their behavior, or describing it other than through this very indirect extrapolation process that takes ages and probably doesn’t make sense either as a thing that can be fully completed.
The utility functions/values that do describe/guide behavior are approximations that are knowably and desirably reflectively unstable, that should keep changing on reflection. As such, optimizing according to them too strongly destroys value and also makes them progressively worse approximations via Goodhart’s Law. An AGI that holds these approximations (proxy goals) as reflectively stable goals is catastrophically misaligned and will destroy value by optimizing for proxy goals past the point where they stop being good approximations of (unknown) intended goals.
So AI alignment is not about alignment of utility functions related to current behavior in any straightforward/useful way. It’s about making sure that optimization is soft and corrigible, that it stops before Goodhart’s Curse starts destroying value, and follows redefinition of value as it grows.
I often have thunk thoughts like “Consider an AI with a utility function that is just barely incorrect, such that it doesn’t place any value on boredom. Then the AI optimizes the universe in a bad way.”
One problem with this thought is that it’s not clear that I’m really thinking about anything in particular, anything which actually exists. What am I actually considering in the above quotation? With respect to what, exactly, is the AI’s utility function “incorrect”? Is there a utility function for which its optimal policies are aligned?
For sufficiently expressive utility functions, the answer has to be “yes.” For example, if the utility function is over the AI’s action histories, you can just hardcode a safe, benevolent policy into the AI: utility 0 if the AI has ever taken a bad action, 1 otherwise. Since there presumably exists at least some sequence of AI outputs which leads to wonderful outcomes, this action-history utility function works.
But this is trivial and not what we mean by a “correct” utility function. So, now I’m left with a puzzle. What does it mean for the AI to have a correct utility function? I do not think this is a quibble. The quoted thought seems ungrounded from the substance of the alignment problem.
I think humans and aligned AGIs are only ever very indirect pointers to preference (value, utility function), and it makes no sense to talk of authoritative/normative utility functions directly relating to their behavior, or describing it other than through this very indirect extrapolation process that takes ages and probably doesn’t make sense either as a thing that can be fully completed.
The utility functions/values that do describe/guide behavior are approximations that are knowably and desirably reflectively unstable, that should keep changing on reflection. As such, optimizing according to them too strongly destroys value and also makes them progressively worse approximations via Goodhart’s Law. An AGI that holds these approximations (proxy goals) as reflectively stable goals is catastrophically misaligned and will destroy value by optimizing for proxy goals past the point where they stop being good approximations of (unknown) intended goals.
So AI alignment is not about alignment of utility functions related to current behavior in any straightforward/useful way. It’s about making sure that optimization is soft and corrigible, that it stops before Goodhart’s Curse starts destroying value, and follows redefinition of value as it grows.