I’m way out of my depth here, but my thought is it’s very common for humans to want to modify their utility functions. For example, a struggling alcoholic would probably love to not value alcohol anymore. There are lots of other examples too of people wanting to modify their personalities or bodies.
It depends on the type of AGI too I would think, if superhuman AI ends up being like a paperclip maximizer that’s just really good at following its utility function then yeah maybe it wouldn’t mess with its utility function. If superintelligence means it has emergent characteristics like opinions and self-reflection or whatever it seems plausible it could want to modify its utility function, say after thinking about philosophy for a while.
Like I said I’m way out of my depth though so maybe that’s all total nonsense.
I’m not convinced “want to modify their utility functions” is the perspective most useful. I think it might be more helpful to say that we each have multiple utility functions, which conflict to varying degrees and have voting power in different areas of the mind. I’ve had first-hand experience with such conflicts (as essentially everyone probably has, knowingly or not), and it feels like fighting yourself. I wish to describe a hypothetical example. “Do I eat that extra donut?”. Part of you wants the donut; the part feels like more of an instinct, a visceral urge. Part of you knows you’ll be ill afterwards, and will feel guilty about cheating your diet; this part feels more like “you”, it’s the part that thinks in words. You stand there and struggle, trying to make yourself walk away, as your hand reaches out for the donut. I’ve been in similar situations where (though I balked at the possible philosophical ramifications) I felt like if I had a button to make me stop wanting the thing, I’d push it—yet often it was the other function that won. I feel like if you gave an agent the ability to modify their utility functions, the one that would win depends on which one had access to the mechanism (do you merely think the thought? push a button?), and whether they understand what the mechanism means. (The word “donut” doesn’t evoke nearly as strong a reaction as a picture of a donut, for instance; your donut-craving subsystem doesn’t inherently understand the word.)
Contrarily, one might argue that cravings for donuts are more hardwired instincts than part of the “mind”, and so don’t count...but I feel like 1. finding a true dividing line is gonna be real hard, and 2. even that aside, I expect many/most people have goals localized in the same part of the mind that nevertheless are not internally consistent, and in some cases there may be reasonable sounding goals that turn out to be completely incompatible with more important goals. In such a case I could imagine an agent deciding it’s better to stop wanting the thing they can’t have.
In the formal sense, having a utility function at all requires you to be consistent, so if you have inconsistent preferences, you don’t have a utility function at all, just preferences.
I’m way out of my depth here, but my thought is it’s very common for humans to want to modify their utility functions. For example, a struggling alcoholic would probably love to not value alcohol anymore. There are lots of other examples too of people wanting to modify their personalities or bodies.
It depends on the type of AGI too I would think, if superhuman AI ends up being like a paperclip maximizer that’s just really good at following its utility function then yeah maybe it wouldn’t mess with its utility function. If superintelligence means it has emergent characteristics like opinions and self-reflection or whatever it seems plausible it could want to modify its utility function, say after thinking about philosophy for a while.
Like I said I’m way out of my depth though so maybe that’s all total nonsense.
I’m not convinced “want to modify their utility functions” is the perspective most useful. I think it might be more helpful to say that we each have multiple utility functions, which conflict to varying degrees and have voting power in different areas of the mind. I’ve had first-hand experience with such conflicts (as essentially everyone probably has, knowingly or not), and it feels like fighting yourself. I wish to describe a hypothetical example. “Do I eat that extra donut?”. Part of you wants the donut; the part feels like more of an instinct, a visceral urge. Part of you knows you’ll be ill afterwards, and will feel guilty about cheating your diet; this part feels more like “you”, it’s the part that thinks in words. You stand there and struggle, trying to make yourself walk away, as your hand reaches out for the donut. I’ve been in similar situations where (though I balked at the possible philosophical ramifications) I felt like if I had a button to make me stop wanting the thing, I’d push it—yet often it was the other function that won. I feel like if you gave an agent the ability to modify their utility functions, the one that would win depends on which one had access to the mechanism (do you merely think the thought? push a button?), and whether they understand what the mechanism means. (The word “donut” doesn’t evoke nearly as strong a reaction as a picture of a donut, for instance; your donut-craving subsystem doesn’t inherently understand the word.)
Contrarily, one might argue that cravings for donuts are more hardwired instincts than part of the “mind”, and so don’t count...but I feel like 1. finding a true dividing line is gonna be real hard, and 2. even that aside, I expect many/most people have goals localized in the same part of the mind that nevertheless are not internally consistent, and in some cases there may be reasonable sounding goals that turn out to be completely incompatible with more important goals. In such a case I could imagine an agent deciding it’s better to stop wanting the thing they can’t have.
If you literally have multiple UFs, you literally are multiple agents. Or you use a term with less formal baggage, like “preferences*.
In the formal sense, having a utility function at all requires you to be consistent, so if you have inconsistent preferences, you don’t have a utility function at all, just preferences.