Other people have given good answers to the main question, but I want to add just a little more context about self-modifying code.
A bunch of MIRI’s early work explored the difficulties of the interaction of “rationality” (including utility functions induced by consistent preferences) with “self-modification” or “self-improvement”; a good example is this paper. They pointed out some major challenges that come up when an agent tries to reason about what future versions of itself will do; this is particularly important because one failure mode of AI alignment is to build an aligned AI that accidentally self-modifies into an unaligned AI (note that continuous learning is a restricted form of self-modification and suffers related problems). There are reasons to expect that powerful AI agents will be self-modifying (ideally self-improving), so this is an important question to have an answer to (relevant keywords include “stable values” and “value drift”).
There’s also some thinking about self-modification in the human-rationality sphere; two things that come to mind are here and here. This is relevant because ways in which humans deviate from having (approximate, implicit) utility functions may be irrational, though the other responses point out limitations of this perspective.
This is relevant because ways in which humans deviate from having (approximate, implicit) utility functions may be irrational
I disagree; simple utility functions are fundamentally incapable of capturing the complexity/subtleties/nuance of human preferences.
I agree with shard theory that “human values are contextual influences on human decision making”.
If you claim that deviations from a utility function are irrational, by what standard do you make that judgment? John Wentworth showed in “Why Subagents?” that inexploitable agents exist that do not have preferences representable by simple/unitary utility functions.
Going further, I think utility functions are anti-natural to generally capable optimisers in the real world.
I suspect that desires for novelty/a sense of boredom (which contribute to the path dependence of human values) or similar mechanisms are necessary to promote sufficient exploration in the real world (though some RL algorithms explore in order to maximise their expected return, so I’m not claiming that EU maximisation does not allow exploration, more that embedded agents in the real world are limited in effectively exploring without inherent drives for it).
simple utility functions are fundamentally incapable of capturing the complexity/subtleties/nuance of human preferences.
No objections there.
that inexploitable agents exist that do not have preferences representable by simple/unitary utility functions.
Yep.
Going further, I think utility functions are anti-natural to generally capable optimisers in the real world.
I tentatively agree.
That said
The existence of a utility function is a sometimes useful simplifying assumption, in a way similar to how logical omniscience is (or should we be doing all math with logical inductors?), and naturally generalizes as a formalism to something like “the set of utility functions consistent with revealed preferences”.
In the context of human rationality, I have found a local utility function perspective to be sometimes useful, especially as a probe into personal reasons; that is, if you say “this is my utility function” and then you notice “huh… my action does not reflect that”, this can prompt useful contemplation, some possible outcomes of which are:
You neglected a relevant term in the utility function, e.g. happiness of your immediate family
You neglected a relevant utility cost of the action you didn’t take, e.g. the aversiveness of being sweaty
You neglected a constraint, e.g. you cannot actually productively work for 12 hours a day, 6 days a week
The circumstances in which you acted are outside the region of validity of your approximation of the utility function, e.g. you don’t actually know how valuable having $100B would be to you
You made a mistake with your action
Of course, a utility function framing is neither necessary nor sufficient for this kind of reflection, but for me, and I suspect some others, it is helpful.
If shard theory is right, the utility functions of the different shards are weighted differently in different contexts.
The relevant criterion is not pareto optimality wrt a set of utility functions/a vector valued utility function.
Or rather pareto optimality will still be a constraint, but the utility function needs to be defined over agent/environment state in order to accord for the context sensitivity.
Other people have given good answers to the main question, but I want to add just a little more context about self-modifying code.
A bunch of MIRI’s early work explored the difficulties of the interaction of “rationality” (including utility functions induced by consistent preferences) with “self-modification” or “self-improvement”; a good example is this paper. They pointed out some major challenges that come up when an agent tries to reason about what future versions of itself will do; this is particularly important because one failure mode of AI alignment is to build an aligned AI that accidentally self-modifies into an unaligned AI (note that continuous learning is a restricted form of self-modification and suffers related problems). There are reasons to expect that powerful AI agents will be self-modifying (ideally self-improving), so this is an important question to have an answer to (relevant keywords include “stable values” and “value drift”).
There’s also some thinking about self-modification in the human-rationality sphere; two things that come to mind are here and here. This is relevant because ways in which humans deviate from having (approximate, implicit) utility functions may be irrational, though the other responses point out limitations of this perspective.
I disagree; simple utility functions are fundamentally incapable of capturing the complexity/subtleties/nuance of human preferences.
I agree with shard theory that “human values are contextual influences on human decision making”.
If you claim that deviations from a utility function are irrational, by what standard do you make that judgment? John Wentworth showed in “Why Subagents?” that inexploitable agents exist that do not have preferences representable by simple/unitary utility functions.
Going further, I think utility functions are anti-natural to generally capable optimisers in the real world.
I suspect that desires for novelty/a sense of boredom (which contribute to the path dependence of human values) or similar mechanisms are necessary to promote sufficient exploration in the real world (though some RL algorithms explore in order to maximise their expected return, so I’m not claiming that EU maximisation does not allow exploration, more that embedded agents in the real world are limited in effectively exploring without inherent drives for it).
No objections there.
Yep.
I tentatively agree.
That said
The existence of a utility function is a sometimes useful simplifying assumption, in a way similar to how logical omniscience is (or should we be doing all math with logical inductors?), and naturally generalizes as a formalism to something like “the set of utility functions consistent with revealed preferences”.
In the context of human rationality, I have found a local utility function perspective to be sometimes useful, especially as a probe into personal reasons; that is, if you say “this is my utility function” and then you notice “huh… my action does not reflect that”, this can prompt useful contemplation, some possible outcomes of which are:
You neglected a relevant term in the utility function, e.g. happiness of your immediate family
You neglected a relevant utility cost of the action you didn’t take, e.g. the aversiveness of being sweaty
You neglected a constraint, e.g. you cannot actually productively work for 12 hours a day, 6 days a week
The circumstances in which you acted are outside the region of validity of your approximation of the utility function, e.g. you don’t actually know how valuable having $100B would be to you
You made a mistake with your action
Of course, a utility function framing is neither necessary nor sufficient for this kind of reflection, but for me, and I suspect some others, it is helpful.
If shard theory is right, the utility functions of the different shards are weighted differently in different contexts.
The relevant criterion is not pareto optimality wrt a set of utility functions/a vector valued utility function.
Or rather pareto optimality will still be a constraint, but the utility function needs to be defined over agent/environment state in order to accord for the context sensitivity.