Hello all,
I am new to alignment theory and was hoping to get a better understanding of utility functions. In particular I’m wondering why we talk about programs as always optimizing some utility function. Is this a known property of computer programs? Is there a theorem or something that says every computer program optimizes some function?
I’m also wondering, does this apply equally well to programs that can change their own code (or programs running on a quantum computer or other things like that, where it is not as deterministic as a typical python script if that makes sense)
Thanks!
I don’t think it’s actually true that “we” talk about programs always optimizing some utility function. Many programs don’t. (Well, I guess you can describe pretty much anything in terms of optimizing a sufficiently artificially-defined utility function, but that’s not a helpful thing to do.)
But
There are theorems that kinda-sorta say that perfectly rational agents have to have some utility function they’re optimizing the expectation of: https://en.wikipedia.org/wiki/Von_Neumann%E2%80%93Morgenstern_utility_theorem.
“Optimizing a utility function” seems like a pretty good approximate description of what anything acting with purpose in the world is doing.
But but
Nothing made out of actual physical matter in the actual physical universe is going to be a perfectly rational agent in the relevant sense.
Human beings definitely don’t behave exactly as if optimizing well-defined utility functions.
It’s easy to envisage nightmare scenarios, should some AI gain a great deal of power, where the AI is singlemindedly optimizing some utility function that produces very bad results when something very powerful optimizes it.
Today’s most impressive AI systems aren’t (so far as we know) trying to optimize anything. (You can loosely describe what they do as “trying to maximize the plausibility of the text being generated”, but there isn’t actually any optimization process going on when the system is running, only when it’s being trained.)
But but but
Some people worry that an AI system that isn’t overtly trying to optimize anything might have things in its inner workings that effectively are performing optimization processes, which could be bad on account of those nightmare scenarios. (Especially as in such a case the optimization target will not have been carefully designed by anyone, it’ll just be something whose optimization produced good results during the training process.)
Anyway: I don’t think anyone thinks it’s helpful to think of all programs as optimizing anything. But some programs, particularly ones that are in some sense trying to get things done in a complicated world, might helpfully be thought of that way, either because they literally are optimizing something or because they’re doing something like optimizing something.
See “Why The Focus on Expected Utility Maximisers”.
Short response: I think unitary utility functions are a distraction and don’t describe the decision making of real world intelligent systems very well.
Longer response: In an environment with a common, scarce and fungible resource that an agent has monotonically nondecreasing preferences over, an agent that is inexploitable with respect to that resource behaves as an expected utility maximiser.
However, the relevant theorems assume that:
The agent has complete preferences: for any pair of options (lotteries in VNM) the agent prefers one of them or is indifferent.
Alternatively, there exists a total order over their preferences
The agent’s preferences are path independent/don’t have any internal state: the option the agent prefers in a particular scenario does not depend on how the agent got there
Consider that human preferences are inherently contextual; what option we prefer in general depends on our history (previous choices, previous experiences, etc.) and the context of the situation
The agent’s preferences are static: they do not change over time
These preconditions are pretty unrealistic and do not describe humans or financial markets well. I do not expect them to describe any generally capable systems in the real world well either. I.e. I suspect that expected utility maximising is anti-natural to general capabilities.
Shard theory presents a compelling rebuttal to expected utility maximisation.
The idea of a utility function comes from various theorems (originating independently of computers and programming) that attempt to codify the concept of “rational choice”. These theorems demonstrate that if someone has a preference relation over the possible outcomes of their actions, and this preference relation satisfies certain reasonable-sounding conditions, then there must exist a numerical function of those outcomes (called the “utility function”) such that their preference relation over actions is equivalent to comparing the expected utilities arising from those actions. Their most preferred action is therefore the one that maximises expected utility.
Here is Eliezer’s exposition of the concept in the context of LessWrong.
The theorem most commonly mentioned is the VNM (Von Neumann-Morgenstern) theorem, but there are several other derivations than theirs of similar results.
The foundations of utility theory are entangled with the foundations of probability. For example, Leonard Savage (The Foundations of Statistics, 1954 and 1972) derives both together from the agent’s preferences.
The theorems are normative: they say that a rational agent must have preferences that can be described by a utility function, or they are liable to, for example, pay to get B instead of A, but then pay again to get A instead of B (without ever having had B before switching back). Actual agents do whatever they do, regardless of the theorems.
One occasionally sees statements to the effect that “everything has a utility function, because we can just attach utility 1 to what it does and 0 to what it doesn’t do.” I call this the Texas Sharpshooter Utility Function, by analogy with the Texas Sharpshooter, who shoots at a barn door and then draws a target around the bullet hole. Such a supposed utility function is exactly as useful as a stopped clock is for telling the time.
The term “utility function” can mean:-
The mathematical sense: A mathematical function with certain properties of consistency, allowing an agent that uses it to make optimal decisions.
The engineering senses: 2.1) An actual code module, something that is distinct and separable in the source code.
2.2) An aspect of the operation that is thoroughly entangled with the other aspects.
2.3) Something external to an AI system, relevant only while it is being trained.
The “stance”. A way of describing or thinking about a system, not necessarily describing anything in the territory, in the spirit or Dennett’s “intentional stance”.
A technical-sounding, but actually vague way of talking about preferences or goals, with no implications to have a particular mathematical engineering implementaiton.
Note how (1) is entirely in the territory, whereas (3) is entirely a map or stance. Note also that only the “engineering” senses are relevant to practical AI safety.
Other people have given good answers to the main question, but I want to add just a little more context about self-modifying code.
A bunch of MIRI’s early work explored the difficulties of the interaction of “rationality” (including utility functions induced by consistent preferences) with “self-modification” or “self-improvement”; a good example is this paper. They pointed out some major challenges that come up when an agent tries to reason about what future versions of itself will do; this is particularly important because one failure mode of AI alignment is to build an aligned AI that accidentally self-modifies into an unaligned AI (note that continuous learning is a restricted form of self-modification and suffers related problems). There are reasons to expect that powerful AI agents will be self-modifying (ideally self-improving), so this is an important question to have an answer to (relevant keywords include “stable values” and “value drift”).
There’s also some thinking about self-modification in the human-rationality sphere; two things that come to mind are here and here. This is relevant because ways in which humans deviate from having (approximate, implicit) utility functions may be irrational, though the other responses point out limitations of this perspective.
I disagree; simple utility functions are fundamentally incapable of capturing the complexity/subtleties/nuance of human preferences.
I agree with shard theory that “human values are contextual influences on human decision making”.
If you claim that deviations from a utility function are irrational, by what standard do you make that judgment? John Wentworth showed in “Why Subagents?” that inexploitable agents exist that do not have preferences representable by simple/unitary utility functions.
Going further, I think utility functions are anti-natural to generally capable optimisers in the real world.
I suspect that desires for novelty/a sense of boredom (which contribute to the path dependence of human values) or similar mechanisms are necessary to promote sufficient exploration in the real world (though some RL algorithms explore in order to maximise their expected return, so I’m not claiming that EU maximisation does not allow exploration, more that embedded agents in the real world are limited in effectively exploring without inherent drives for it).
No objections there.
Yep.
I tentatively agree.
That said
The existence of a utility function is a sometimes useful simplifying assumption, in a way similar to how logical omniscience is (or should we be doing all math with logical inductors?), and naturally generalizes as a formalism to something like “the set of utility functions consistent with revealed preferences”.
In the context of human rationality, I have found a local utility function perspective to be sometimes useful, especially as a probe into personal reasons; that is, if you say “this is my utility function” and then you notice “huh… my action does not reflect that”, this can prompt useful contemplation, some possible outcomes of which are:
You neglected a relevant term in the utility function, e.g. happiness of your immediate family
You neglected a relevant utility cost of the action you didn’t take, e.g. the aversiveness of being sweaty
You neglected a constraint, e.g. you cannot actually productively work for 12 hours a day, 6 days a week
The circumstances in which you acted are outside the region of validity of your approximation of the utility function, e.g. you don’t actually know how valuable having $100B would be to you
You made a mistake with your action
Of course, a utility function framing is neither necessary nor sufficient for this kind of reflection, but for me, and I suspect some others, it is helpful.
If shard theory is right, the utility functions of the different shards are weighted differently in different contexts.
The relevant criterion is not pareto optimality wrt a set of utility functions/a vector valued utility function.
Or rather pareto optimality will still be a constraint, but the utility function needs to be defined over agent/environment state in order to accord for the context sensitivity.
No, utility functions are not a property of computer programs in general. They are a property of (a certain class of) agents.
A utility function is just a way for an agent to evaluate states, where positive values are good (for states the agent wants to achieve), negative values are bad (for states the agent wants to avoid), and neutral values are neutral (for states the agent doesn’t care about one way or the other). This mapping from states to utilities can be anything in principle: a measure of how close to homeostasis the agent’s internal state is, a measure of how many smiles exist on human faces, a measure of the number of paperclips in the universe, etc. It all depends on how you program the agent (or how our genes and culture program us).
Utility functions drive decision-making. Behavioral policies and actions that tend to lead to states of high utility will get positively reinforced, such that the agent will learn to do those things more often. And policies/actions that tend to lead to states of low (or negative) utility will get negatively reinforced, such that the agent learns to do them less often. Eventually, the agent learns to steer the world toward states of maximum utility.
Depending on how aligned an AI’s utility function is with humanity’s, this could be good or bad. It turns out that for highly capable agents, this tends to be bad far more often than good (e.g., maximizing smiles or paperclips will lead to a universe devoid of value for humans).
Nondeterminism really has nothing to do with this. Agents that can modify their own code could in principle optimize for their utility functions even more strongly than if they were stuck at a certain level of capability, but a utility function still needs to be specified in some way regardless.