An Orthodox Case Against Utility Functions
This post has benefitted from discussion with Sam Eisenstat, Scott Garrabrant, Tsvi Benson-Tilsen, Daniel Demski, Daniel Kokotajlo, and Stuart Armstrong. It started out as a thought about Stuart Armstrong’s research agenda.
In this post, I hope to say something about what it means for a rational agent to have preferences. The view I am putting forward is relatively new to me, but it is not very radical. It is, dare I say, a conservative view—I hold close to Bayesian expected utility theory. However, my impression is that it differs greatly from common impressions of Bayesian expected utility theory.
I will argue against a particular view of expected utility theory—a view which I’ll call reductive utility. I do not recall seeing this view explicitly laid out and defended (except in in-person conversations). However, I expect at least a good chunk of the assumptions are commonly made.
Reductive Utility
The core tenets of reductive utility are as follows:
The sample space of a rational agent’s beliefs is, more or less, the set of possible ways the world could be—which is to say, the set of possible physical configurations of the universe. Hence, each world is one such configuration.
The preferences of a rational agent are represented by a utility function from worlds to real numbers.
Furthermore, the utility function should be a computable function of worlds.
Since I’m setting up the view which I’m knocking down, there is a risk I’m striking at a straw man. However, I think there are some good reasons to find the view appealing. The following subsections will expand on the three tenets, and attempt to provide some motivation for them.
If the three points seem obvious to you, you might just skip to the next section.
Worlds Are Basically Physical
What I mean here resembles the standard physical-reductionist view. However, my emphasis is on certain features of this view:
There is some “basic stuff”—like like quarks or vibrating strings or what-have-you.
What there is to know about the world is some set of statements about this basic stuff—particle locations and momentums, or wave-form function values, or what-have-you.
These special atomic statements should be logically independent from each other (though they may of course be probabilistically related), and together, fully determine the world.
These should (more or less) be what beliefs are about, such that we can (more or less) talk about beliefs in terms of the sample space as being the set of worlds understood in this way.
This is the so-called “view from nowhere”, as Thomas Nagel puts it.
I don’t intend to construe this position as ruling out certain non-physical facts which we may have beliefs about. For example, we may believe indexical facts on top of the physical facts—there might be (1) beliefs about the universe, and (2) beliefs about where we are in the universe. Exceptions like this violate an extreme reductive view, but are still close enough to count as reductive thinking for my purposes.
Utility Is a Function of Worlds
So we’ve got the “basically physical” . Now we write down a utility function . In other words, utility is a random variable on our event space.
What’s the big deal?
One thing this is saying is that preferences are a function of the world. Specifically, preferences need not only depend on what is observed. This is incompatible with standard RL in a way that matters.
But, in addition to saying that utility can depend on more than just observations, we are restricting utility to only depend on things that are in the world. After we consider all the information in , there cannot be any extra uncertainty about utility—no extra “moral facts” which we may be uncertain of. If there are such moral facts, they have to be present somewhere in the universe (at least, derivable from facts about the universe).
One implication of this: if utility is about high-level entities, the utility function is responsible for deriving them from low-level stuff. For example, if the universe is made of quarks, but utility is a function of beauty, consciousness, and such, then needs to contain the beauty-detector and consciousness-detector and so on—otherwise how can it compute utility given all the information about the world?
Utility Is Computable
Finally, and most critically for the discussion here, should be a computable function.
To clarify what I mean by this: should have some sort of representation which allows us to feed it into a Turing machine—let’s say it’s an infinite bit-string which assigns true or false to each of the “atomic sentences” which describe the world. should be a computable function; that is, there should be a Turing machine which takes a rational number and takes , prints a rational number within of , and halts. (In other words, we can compute to any desired degree of approximation.)
Why should be computable?
One argument is that should be computable because the agent has to be able to use it in computations. This perspective is especially appealing if you think of as a black-box function which you can only optimize through search. If you can’t evaluate , how are you supposed to use it? If exists as an actual module somewhere in the brain, how is it supposed to be implemented? (If you don’t think this sounds very convincing, great!)
Requiring to be computable may also seem easy. What is there to lose? Are there preference structures we really care about being able to represent, which are fundamentally not computable?
And what would it even mean for a computable agent to have non-computable preferences?
However, the computability requirement is more restrictive than it may seem.
There is a sort of continuity implied by computability: must not depend too much on “small” differences between worlds. The computation only accesses finitely many bits of before it halts. All the rest of the bits in must not make more than difference to the value of .
This means some seemingly simple utility functions are not computable.
As an example, consider the procrastination paradox. Your task is to push a button. You get 10 utility for pushing the button. You can push it any time you like. However, if you never press the button, you get −10. On any day, you are fine with putting the button-pressing off for one more day. Yet, if you put it off forever, you lose!
We can think of as a string like 000000100.., where the “1” is the day you push the button. To compute the utility, we might look for the “1″, outputting 10 if we find it.
But what about the all-zero universe, 0000000...? The program must loop forever. We can’t tell we’re in the all-zero universe by examining any finite number of bits. You don’t know whether you will eventually push the button. (Even if the universe also gives your source code, you can’t necessarily tell from that—the logical difficulty of determining this about yourself is, of course, the original point of the procrastination paradox.)
Hence, a preference structure like this is not computable, and is not allowed according to the reductive utility doctrine.
The advocate of reductive utility might take this as a victory. The procrastination paradox has been avoided, and other paradoxes with a similar structure. (The St. Petersburg Paradox is another example.)
On the other hand, if you think this is a legitimate preference structure, dealing with such ‘problematic’ preferences motivates abandonment of reductive utility.
Subjective Utility: The Real Thing
We can strongly oppose all three points without leaving orthodox Bayesianism. Specifically, I’ll sketch how the Jeffrey-Bolker axioms enable non-reductive utility. (The title of this section is a reference to Jeffrey’s book Subjective Probability: The Real Thing.)
However, the real position I’m advocating is more grounded in logical induction rather than the Jeffrey-Bolker axioms; I’ll sketch that version at the end.
The View From Somewhere
The reductive-utility view approached things from the starting-point of the universe. Beliefs are for what is real, and what is real is basically physical.
The non-reductive view starts from the standpoint of the agent. Beliefs are for things you can think about. This doesn’t rule out a physicalist approach. What it does do is give high-level objects like tables and chairs an equal footing with low-level objects like quarks: both are inferred from sensory experience by the agent.
Rather than assuming an underlying set of worlds, Jeffrey-Bolker assume only a set of events. For two events and , the conjunction exists, and the disjunction , and the negations and . However, unlike in the Kolmogorov axioms, these are not assumed to be intersection, union, and complement of an underlying set of worlds.
Let me emphasize that: we need not assume there are “worlds” at all.
In philosophy, this is called situation semantics—an alternative to the more common possible-world semantics. In mathematics, it brings to mind pointless topology.
In the Jeffrey-Bolker treatment, a world is just a maximally specific event: an event which describes everything completely. But there is no requirement that maximally-specific events exist. Perhaps any event, no matter how detailed, can be further extended by specifying some yet-unmentioned stuff. (Indeed, the Jeffrey-Bolker axioms assume this! Although, Jeffrey does not seem philosophically committed to that assumption, from what I have read.)
Thus, there need not be any “view from nowhere”—no semantic vantage point from which we see the whole universe.
This, of course, deprives us of the objects which utility was a function of, in the reductive view.
Utility Is a Function of Events
The reductive-utility makes a distinction between utility—the random variable itself—and expected utility, which is the subjective estimate of the random variable which we use for making decisions.
The Jeffrey-Bolker framework does not make a distinction. Everything is a subjective preference evaluation.
A reductive-utility advocate sees the expected utility of an event as derived from the utility of the worlds within the event. They start by defining ; then, we define the expected utility of an event as -- or, more generally, the corresponding integral.
In the Jeffrey-Bolker framework, we instead define directly on events. These preferences are required to be coherent with breaking things up into sums, so = -- but we do not define one from the other.
We don’t have to know how to evaluate entire worlds in order to evaluate events. All we have to know is how to evaluate events!
I find it difficult to really believe “humans have a utility function”, even approximately—but I find it much easier to believe “humans have expectations on propositions”. Something like that could even be true at the neural level (although of course we would not obey the Jeffrey-Bolker axioms in our neural expectations).
Updates Are Computable
Jeffrey-Bolker doesn’t say anything about computability. However, if we do want to address this sort of issue, it leaves us in a different position.
Because subjective expectation is primary, it is now more natural to require that the agent can evaluate events, without any requirement about a function on worlds. (Of course, we could do that in the Kolmogorov framework.)
Agents don’t need to be able to compute the utility of a whole world. All they need to know is how to update expected utilities as they go along.
Of course, the subjective utility can’t be just any way of updating as you go along. It needs to be coherent, in the sense of the Jeffrey-Bolker axioms. And, maintaining coherence can be very difficult. But it can be quite easy even in cases where the random-variable treatment of the utility function is not computable.
Let’s go back to the procrastination example. In this case, to evaluate the expected utility of each action at a given time-step, the agent does not need to figure out whether it ever pushes the button. It just needs to have some probability, which it updates over time.
For example, an agent might initially assign probability to pressing the button at time , and to never pressing the button. Its probability that it would ever press the button, and thus its utility estimate, would decrease with each observed time-step in which it didn’t press the button. (Of course, such an agent would press the button immediately.)
Of course, this “solution” doesn’t touch on any of the tricky logical issues which the procrastination paradox was originally introduced to illustrate. This isn’t meant as a solution to the procrastination paradox—only as an illustration of how to coherently update discontinuous preferences. This simple is uncomputable by the definition of the previous section.
It also doesn’t address computational tractability in a very real way, since if the prior is very complicated, computing the subjective expectations can get extremely difficult.
We can come closer to addressing logical issues and computational tractability by considering things in a logical induction framework.
Utility Is Not a Function
In a logical induction (LI) framework, the central idea becomes “update your subjective expectations in any way you like, so long as those expectations aren’t (too easily) exploitable to Dutch-book.” This clarifies what it means for the updates to be “coherent”—it is somewhat more elegant than saying ”… any way you like, so long as they follow the Jeffrey-Bolker axioms.”
This replaces the idea of “utility function” entirely—there isn’t any need for a function any more, just a logically-uncertain-variable (LUV, in the terminology from the LI paper).
Actually, there are different ways one might want to set things up. I hope to get more technical in a later post. For now, here’s some bullet points:
In the simple procrastination-paradox example, you push the button if you have any uncertainty at all. So things are not that interesting. But, at least we’ve solved the problem.
In more complicated examples—where there is some real benefit to procrastinating—a LI-based agent could totally procrastinate forever. This is because LI doesn’t give any guarantee about converging to correct beliefs for uncomputable propositions like whether Turing machines halt or whether people stop procrastinating.
Believing you’ll stop procrastinating even though you won’t is perfectly coherent—in the same way that believing in nonstandard numbers is perfectly logically consistent. Putting ourselves in the shoes of such an agent, this just means we’ve examined our own decision-making to the best of our ability, and have put significant probability on “we don’t procrastinate forever”. This kind of reasoning is necessarily fallible.
Yet, if a system we built were to do this, we might have strong objections. So, this can count as an alignment problem. How can we give feedback to a system to avoid this kind of mistake? I hope to work on this question in future posts.
- Radical Probabilism by Aug 18, 2020, 9:14 PM; 182 points) (
- Book Launch: “The Carving of Reality,” Best of LessWrong vol. III by Aug 16, 2023, 11:52 PM; 131 points) (
- Why The Focus on Expected Utility Maximisers? by Dec 27, 2022, 3:49 PM; 116 points) (
- Voting Results for the 2020 Review by Feb 2, 2022, 6:37 PM; 108 points) (
- Prizes for the 2020 Review by Feb 20, 2022, 9:07 PM; 94 points) (
- Meaning & Agency by Dec 19, 2023, 10:27 PM; 91 points) (
- Model splintering: moving from one imperfect model to another by Aug 27, 2020, 11:53 AM; 79 points) (
- 2020 Review Article by Jan 14, 2022, 4:58 AM; 74 points) (
- Comparing Utilities by Sep 14, 2020, 8:56 PM; 71 points) (
- Ben Pace’s Controversial Picks for the 2020 Review by Dec 27, 2021, 6:25 PM; 65 points) (
- The Pointers Problem: Clarifications/Variations by Jan 5, 2021, 5:29 PM; 61 points) (
- Other-centered ethics and Harsanyi’s Aggregation Theorem by Feb 2, 2022, 3:21 AM; 59 points) (EA Forum;
- My Current Take on Counterfactuals by Apr 9, 2021, 5:51 PM; 54 points) (
- Sunday June 21st – talks by Abram Demski, alkjash, orthonormal, eukaryote, Vaniver by Jun 18, 2020, 8:10 PM; 49 points) (
- Value learning in the absence of ground truth by Feb 5, 2024, 6:56 PM; 47 points) (
- Normativity by Nov 18, 2020, 4:52 PM; 47 points) (
- Model-based RL, Desires, Brains, Wireheading by Jul 14, 2021, 3:11 PM; 22 points) (
- Procrastination Paradoxes: the Good, the Bad, and the Ugly by Aug 6, 2020, 7:47 PM; 21 points) (
- [AN #95]: A framework for thinking about how to make AI go well by Apr 15, 2020, 5:10 PM; 20 points) (
- Aug 27, 2020, 4:08 PM; 20 points) 's comment on nostalgebraist: Recursive Goodhart’s Law by (
- Technical model refinement formalism by Aug 27, 2020, 11:54 AM; 19 points) (
- Online Curated LessWrong Talks by Jun 19, 2020, 2:16 AM; 16 points) (
- Topological metaphysics: relating point-set topology and locale theory by May 1, 2020, 3:57 AM; 14 points) (
- [AN #143]: How to make embedded agents that reason probabilistically about their environments by Mar 24, 2021, 5:20 PM; 13 points) (
- Jun 23, 2021, 5:08 PM; 13 points) 's comment on I’m no longer sure that I buy dutch book arguments and this makes me skeptical of the “utility function” abstraction by (
- Aug 26, 2020, 6:03 PM; 12 points) 's comment on nostalgebraist: Recursive Goodhart’s Law by (
- Reinforcement Learner Wireheading by Jul 8, 2022, 5:32 AM; 8 points) (
- Feb 2, 2022, 5:39 PM; 7 points) 's comment on Impossibility results for unbounded utilities by (
- Sep 22, 2021, 12:27 AM; 6 points) 's comment on Three enigmas at the heart of our reasoning by (
- Oct 5, 2022, 8:03 PM; 4 points) 's comment on The Pointers Problem: Clarifications/Variations by (
- Sep 9, 2022, 3:54 PM; 4 points) 's comment on Can “Reward Economics” solve AI Alignment? by (
- Aug 6, 2024, 6:23 PM; 2 points) 's comment on Circular Reasoning by (
- Aug 15, 2020, 7:05 AM; 2 points) 's comment on Alignment By Default by (
- Aug 27, 2020, 5:32 PM; 2 points) 's comment on nostalgebraist: Recursive Goodhart’s Law by (
- Mar 22, 2025, 1:23 AM; 2 points) 's comment on A Critique of “Utility” by (
- Jun 22, 2021, 6:33 PM; 2 points) 's comment on I’m no longer sure that I buy dutch book arguments and this makes me skeptical of the “utility function” abstraction by (
- Jun 22, 2021, 11:24 PM; 2 points) 's comment on My Current Take on Counterfactuals by (
- Jun 4, 2021, 2:48 PM; 2 points) 's comment on My AGI Threat Model: Misaligned Model-Based RL Agent by (
- Jun 2, 2021, 9:16 PM; 2 points) 's comment on My AGI Threat Model: Misaligned Model-Based RL Agent by (
- Aug 3, 2021, 9:28 AM; 1 point) 's comment on Re-Define Intent Alignment? by (
- Oct 3, 2024, 3:59 AM; 1 point) 's comment on Another argument against utility-centric alignment paradigms by (
In this post, the author presents a case for replacing expected utility theory with some other structure which has no explicit utility function, but only quantities that correspond to conditional expectations of utility.
To provide motivation, the author starts from what he calls the “reductive utility view”, which is the thesis he sets out to overthrow. He then identifies two problems with the view.
The first problem is about the ontology in which preferences are defined. In the reductive utility view, the domain of the utility function is the set of possible universes, according to the best available understanding of physics. This is objectionable, because then the agent needs to somehow change the domain as its understanding of physics grows (the ontological crisis problem). It seems more natural to allow the agent’s preferences to be specified in terms of the high-level concepts it cares about (e.g. human welfare or paperclips), not in terms of the microscopic degrees of freedom (e.g. quantum fields or strings). There are also additional complications related to the unobservability of rewards, and to “moral uncertainty”.
The second problem is that the reductive utility view requires the utility function to be computable. The author considers this an overly restrictive requirement, since it rules out utility functions such as in the procrastination paradox (1 is the button is ever pushed, 0 if the button is never pushed). More generally, computable utility function have to be continuous (in the sense of the topology on the space of infinite histories which is obtained from regarding it as an infinite cartesian product over time).
The alternative suggested by the author is using the Jeffrey-Bolker framework. Alas, the author does not write down the precise mathematical definition of the framework, which I find frustrating. The linked article in the Stanford Encyclopedia of Philosophy is long and difficult, and I wish the post had a succinct distillation of the relevant part.
The gist of Jeffrey-Bolker is, there are some propositions which we can make about the world, and each such proposition is assigned a number (its “desirability”). This corresponds to the conditional expected value of the utility function, with the proposition serving as a condition. However, there need not truly be a probability space and a utility function which realizes this correspondence, instead we can work directly with the assignment of numbers to propositions (as long as it satisfies some axioms).
In my opinion, the Jeffrey-Bolker framework seems interesting, but the case presented in the post for using it is weak. To see why, let’s return to our motivating problems.
The problem of ontology is a real problem, in this I agree with the author completely. However, Jeffrey-Bolker only offers some hint of a solution at best. To have a complete solution, one would need to explain in what language are propositions are constructed and how the agent updates the desirability of propositions according to observations, and then prove some properties about the resulting framework which give it prescriptive power. I think that the author believes this can be achieved using Logical Induction, but the burden of proof is not met.
Hence, Jeffrey-Bolker is not sufficient to solve the problem. Moreover, I believe it is also not necessary! Indeed, infra-Bayesian physicalism offers a solution to the ontology problem which doesn’t require abandoning the concept of a utility function (although one has to replace the ordinary probabilistic expectations with infra-Bayesian expectations). That solution certainly has caveats (primarily, the monotonicity principle), but at the least it shows that utility functions are not entirely incompatible with solving the ontology problem.
On the other hand, with the problem of computability, I am not convinced by the author’s motivation. Do we truly need uncomputable utility functions? I am skeptical towards inquiries which are grounded in generalization for the sake of generalization. I think it is often more useful to thoroughly understand the simplest non-trivial special case, before we can confidently assert which generalizations are possible or desirable. And it is not the case with rational agent theory that the special case of computable utility functions is so thoroughly understood.
Moreover, I am not convinced that Jeffrey-Bolker allows us handling uncomputable utility functions as easily as the authors suggests. The author’s argument goes: the utility function might be uncomputable, but as long as its conditional expectations w.r.t. “valid” propositions are computable, there is no problem for rational behavior to be computable. But, how often does it happen that the utility function is uncomputable but all the relevant conditional expectations are computable?
The author suggests the following example: take the procrastination utility function and take some computable distribution over the first time when the button is pushed, plus a probability for the button to never be pushed. Then, we can compute the probability the button is pushed conditional that it wasn’t pushed for the first n rounds. Alright, but now let’s consider a different distribution. Suppose a random Turing machine M is chosen[1] at the beginning of time, and on round n the button is pushed iff M halts after n steps. Notice that this distribution on sequences is perfectly computable[2]. But now, computing the probability that the button is pushed is impossible, since it’s the (in)famous Chaitin constant.
Here too, the author seems to believe that Logical Induction should solve the procrastination paradox and issues with uncomptuable utility functions more generally, as a special case of Jeffrey-Bolker. But, so far I remain unconvinced.
That is, we compose a random program for a prefix-free UTM by repeatedly flipping a fair coin, as usual in algorithmic information theory.
It’s even polynomial-time sampleable.
An Orthodox Case Against Utility Functions was a shocking piece to me. Abram spends the first half of the post laying out a view he suspects people hold, but he thinks is clearly wrong, which is a perspective that approaches things “from the starting-point of the universe”. I felt dread reading it, because it was a view I held at the time, and I used as a key background perspective when I discussed bayesian reasoning. The rest of the post lays out an alternative perspective that “starts from the standpoint of the agent”. Instead of my beliefs being about the universe, my beliefs are about my experiences and thoughts.
I generally nod along to a lot of the ‘scientific’ discussion in the 21st century about how the universe works and how reasonable the whole thing is. But I don’t feel I knew in-advance to expect the world around me to operate on simple mathematical principles and be so reasonable. I could’ve woken up in the Harry Potter universe of magic wands and spells. I know I didn’t, but if I did, I think I would be able to act in it? I wouldn’t constantly be falling over myself because I don’t understand how 1 + 1 = 2 anymore? There’s some place I’m starting from that builds up to an understanding of the universe, and doesn’t sneak it in as an ‘assumption’.
And this is what this new perspective does that Abram lays out in technical detail. (I don’t follow it all, for instance I don’t recall why it’s important that the former view assumes that utility is computable.) In conclusion, this piece is a key step from the existing philosophy of agents to the philosophy of embedded agents, or at least it was for me, and it changes my background perspective on rationality. It’s the only post in the early vote that I gave +9.
(This review is taken from my post Ben Pace’s Controversial Picks for the 2020 Review.)
Partly because the “reductive utility” view is made a bit more extreme than it absolutely had to be. Partly because I think it’s extremely natural, in the “LessWrong circa 2014 view”, to say sentences like “I don’t even know what it would mean for humans to have uncomputable utility functions—unless you think the brain is uncomputable”. (I think there is, or at least was, a big overlap between the LW crowd and the set of people who like to assume things are computable.) Partly because the post was directly inspired by another alignment researcher saying words similar to those, around 2019.
Without this assumption, the core of the “reductive utility” view would be that it treats utility functions as actual functions from actual world-states to real numbers. These functions wouldn’t have to be computable, but since they’re a basic part of the ontology of agency, it’s natural to suppose they are—in exactly the same way it’s natural to suppose that an agent’s beliefs should be computable, and in a similar way to how it seems natural to suppose that physical laws should be computable.
Ah, I guess you could say that I shoved the computability assumption into the reductive view because I secretly wanted to make 3 different points:
We can define beliefs directly on events, rather than needing “worlds”, and this view seems more general and flexible (and closer to actual reasoning).
We can define utility directly on events, rather than “worlds”, too, and there seem to be similar advantages here.
In particular, uncomputable utility functions seem pretty strange if you think utility is a function on worlds; but if you think it’s defined as a coherent expectation on events, then it’s more natural to suppose that the underlying function on worlds (that would justify the event expectations) isn’t computable.
Rather than make these three points separately, I set up a false dichotomy for illustration.
Also worth highlighting that, like my post Radical Probabilism, this post is mostly communicating insights that it seems Richard Jeffrey had several decades ago.