I’m no longer sure that I buy dutch book arguments, in full generality, and this makes me skeptical of the “utility function” abstraction
Thesis: I now think that utility functions might be a pretty bad abstraction for thinking about the behavior of agents in general including highly capable agents.
[Epistemic status: half-baked, elucidating an intuition. Possibly what I’m saying here is just wrong, and someone will helpfully explain why.]
Over the past years, in thinking about agency and AI, I’ve taken the concept of a “utility function” for granted as the natural way to express an entity’s goals or preferences.
Of course, we know that humans don’t have well defined utility functions (they’re inconsistent, and subject to all kinds of framing effects), but that’s only because humans are irrational. To the extent that a thing acts like an agent, it’s behavior corresponds to some utility function. That utility function might not be explicitly represented, but if an agent is rational, there’s some utility function that reflects it’s preferences.
Given this, I might be inclined to scoff at people who scoff at “blindly maximizing” AGIs. “They just don’t get it”, I might think. “They don’t understand why agency has to conform to some utility function, and an AI would try to maximize expected utility.”
Currently, I’m not so sure. I think that talking in terms of utility functions is biting a philosophical bullet, and importing some unacknowledged assumptions. Rather than being the natural way to conceive of preferences and agency, I think utility functions might be only one possible abstraction, and one that emphasizes the wrong features, giving a distorted impression of what agents, in general, are actually like.
I want to explore that possibility in this post.
Before I begin, I want to make two notes.
First, all of this is going to be hand-wavy intuition. I don’t have crisp knock-down arguments, only a vague discontent. But it seems like more progress will follow if I write up my current, tentative, stance even without formal arguments.
Second, I don’t think utility functions being a poor abstraction for agency in the real world has much bearing on whether there is AI risk. As I’ll discuss, it might change the shape and tenor of the problem, but highly capable agents with alien seed preferences are still likely to be catastrophic to human civilization and human values. I mention this because the sentiments expressed in this essay are casually downstream of conversations that I’ve had with skeptics about whether there is AI risk at all. So I want to highlight: I think I was mistakenly overlooking some philosophical assumptions, but that is not a crux.
Is coherence overrated?
The tagline of the “utility” page on arbital is “The only coherent way of wanting things is to assign consistent relative scores to outcomes.”
This is true as far as it goes, but to me, at least, that sentence implies a sort of dominance of utility functions. “Coherent” is a technical term, with a precise meaning, but it also has connotations of “the correct way to do things”. If someone’s theory of agency is incoherent, that seems like a mark against it.
But it is possible to ask, “What’s so good about coherence anyway? Maybe
The standard reply of course, is that if your preferences are incoherent, you’re dutchbookable, and someone will pump you for money.
But I’m not satisfied with this argument. It isn’t obvious that being dutch booked is a bad thing.
Suppose I tell you that I prefer pineapple to mushrooms on my pizza. Suppose you’re about to give me a slice of mushroom pizza; but by paying one penny ($0.01) I can instead get a slice of pineapple pizza (which is just as fresh from the oven). It seems realistic to say that most people with a pineapple pizza preference would probably pay the penny, if they happened to have a penny in their pocket. 1
After I pay the penny, though, and just before I’m about to get the pineapple pizza, you offer me a slice of onion pizza instead—no charge for the change! If I was telling the truth about preferring onion pizza to pineapple, I should certainly accept the substitution if it’s free.
And then to round out the day, you offer me a mushroom pizza instead of the onion pizza, and again, since I prefer mushrooms to onions, I accept the swap.
I end up with exactly the same slice of mushroom pizza I started with… and one penny poorer, because I previously paid $0.01 to swap mushrooms for pineapple.
This seems like a qualitatively bad behavior on my part.
Eliezer asserts that this is “qualitatively bad behavior.” But I think that this is biting a philosophical bullet.
As an intuition pump: In the actual case of humans, we seem to get utility not from states of the world, but from changes in states of the world. So it isn’t unusual for a human to pay to cycle between states of the world.
For instance, I could imagine a human being hungry, eating a really good meal, feeling full, and then happily paying a fee to be instantly returned to their hungry state, so that they can enjoy eating a good meal again.
This is technically a dutch booking (which do they prefer, being hungry or being full?), but from the perspective of the agent’s values there’s nothing qualitatively bad about it. Instead of the dutchbooker pumping money from the agent, he’s offering a useful and appreciated service.
Of course, we can still back out a utility function from this dynamic: instead of having a mapping of ordinal numbers to world states, we can have one from ordinal numbers to changes from world state to another.
But that just passes the buck one level. I see no reason in principle that an agent might have a preference to rotate between different changes in the world, just as well as rotating different between states of the world.
But this also misses the central point. I think you can always construct a utility function that represents some behavior. But if one is no longer compelled by dutch book arguments, this begs the question of why we would want to do that. If coherence is no longer a desiderata, it’s no longer clear that a utility function is that natural way to express preferences.
And I wonder, maybe this also applies to agents in general, or at least the kind of learned agents that humans are likely to build via gradient descent.
Maximization behavior
I think this matters, because many of the classic AI risk arguments go through a claim that maximization behavior is convergent. If you try to build a satisficer, there are a number of pressures for it to become a maximizer of some kind. (See this Rob Miles video, for instance)
I think that most arguments of that sort depend on an agent acting according to an expected utility maximization framework. And utility maximization turns out not to be a good abstraction for agents in the real world, I don’t know if these arguments are still correct.
I posit that straightforward maximizers are rare in the multiverse, and that most evolved or learned agents are better described by some other abstraction.
If not utility functions, then what?
If we accept for the time being that utility functions are a warped abstraction for most agents, what might a better abstraction be?
I don’t know. I’m writing this post in the hopes that others will think about this question and perhaps come up with productive alternative formulations.
I’ll post some of my half-baked thoughts on this question shortly.
I’ve long been somewhat skeptical that utility functions are the right abstraction.
My argument is also rather handwavy, being something like “this is the wrong abstraction for how agents actually function, so even if you can always construct a utility function and say some interesting things about its properties, it doesn’t tell you the thing you need to know to understand and predict how an agent will behave”. In my mind I liken it to the state of trying to code in functional programming languages on modern computers: you can do it, but you’re also fighting an uphill battle against the way the computer is physically implemented, so don’t be surprised if things get confusing.
And much like in the utility function case, people still program in functional languages because of the benefits they confer. I think the same is true of utility functions: they confer some big benefits when trying to reason about certain problems, so we accept the tradeoffs of using them. I think that’s fine so long as we have a morphism to other abstractions that will work better for understanding the things that utility functions obscure.
Utility functions are especially problematic in modeling behaviour for agents with bounded rationality, or those where there are costs of reasoning. These include every physically realizable agent.
For modelling human behaviour, even considering the ideals of what we would like human behaviour to achieve, there are even worse problems. We can hope that there is some utility function consistent with the behaviour we’re modelling and just ignore cases where there isn’t, but that doesn’t seem satisfactory either.
I’m no longer sure that I buy dutch book arguments, in full generality, and this makes me skeptical of the “utility function” abstraction
Thesis: I now think that utility functions might be a pretty bad abstraction for thinking about the behavior of agents in general including highly capable agents.
[Epistemic status: half-baked, elucidating an intuition. Possibly what I’m saying here is just wrong, and someone will helpfully explain why.]
Over the past years, in thinking about agency and AI, I’ve taken the concept of a “utility function” for granted as the natural way to express an entity’s goals or preferences.
Of course, we know that humans don’t have well defined utility functions (they’re inconsistent, and subject to all kinds of framing effects), but that’s only because humans are irrational. To the extent that a thing acts like an agent, it’s behavior corresponds to some utility function. That utility function might not be explicitly represented, but if an agent is rational, there’s some utility function that reflects it’s preferences.
Given this, I might be inclined to scoff at people who scoff at “blindly maximizing” AGIs. “They just don’t get it”, I might think. “They don’t understand why agency has to conform to some utility function, and an AI would try to maximize expected utility.”
Currently, I’m not so sure. I think that talking in terms of utility functions is biting a philosophical bullet, and importing some unacknowledged assumptions. Rather than being the natural way to conceive of preferences and agency, I think utility functions might be only one possible abstraction, and one that emphasizes the wrong features, giving a distorted impression of what agents, in general, are actually like.
I want to explore that possibility in this post.
Before I begin, I want to make two notes.
First, all of this is going to be hand-wavy intuition. I don’t have crisp knock-down arguments, only a vague discontent. But it seems like more progress will follow if I write up my current, tentative, stance even without formal arguments.
Second, I don’t think utility functions being a poor abstraction for agency in the real world has much bearing on whether there is AI risk. As I’ll discuss, it might change the shape and tenor of the problem, but highly capable agents with alien seed preferences are still likely to be catastrophic to human civilization and human values. I mention this because the sentiments expressed in this essay are casually downstream of conversations that I’ve had with skeptics about whether there is AI risk at all. So I want to highlight: I think I was mistakenly overlooking some philosophical assumptions, but that is not a crux.
Is coherence overrated?
The tagline of the “utility” page on arbital is “The only coherent way of wanting things is to assign consistent relative scores to outcomes.”
This is true as far as it goes, but to me, at least, that sentence implies a sort of dominance of utility functions. “Coherent” is a technical term, with a precise meaning, but it also has connotations of “the correct way to do things”. If someone’s theory of agency is incoherent, that seems like a mark against it.
But it is possible to ask, “What’s so good about coherence anyway? Maybe
The standard reply of course, is that if your preferences are incoherent, you’re dutchbookable, and someone will pump you for money.
But I’m not satisfied with this argument. It isn’t obvious that being dutch booked is a bad thing.
In, Coherent Decisions Imply Consistent Utilities, Eliezer says,
Eliezer asserts that this is “qualitatively bad behavior.” But I think that this is biting a philosophical bullet.
As an intuition pump: In the actual case of humans, we seem to get utility not from states of the world, but from changes in states of the world. So it isn’t unusual for a human to pay to cycle between states of the world.
For instance, I could imagine a human being hungry, eating a really good meal, feeling full, and then happily paying a fee to be instantly returned to their hungry state, so that they can enjoy eating a good meal again.
This is technically a dutch booking (which do they prefer, being hungry or being full?), but from the perspective of the agent’s values there’s nothing qualitatively bad about it. Instead of the dutchbooker pumping money from the agent, he’s offering a useful and appreciated service.
Of course, we can still back out a utility function from this dynamic: instead of having a mapping of ordinal numbers to world states, we can have one from ordinal numbers to changes from world state to another.
But that just passes the buck one level. I see no reason in principle that an agent might have a preference to rotate between different changes in the world, just as well as rotating different between states of the world.
But this also misses the central point. I think you can always construct a utility function that represents some behavior. But if one is no longer compelled by dutch book arguments, this begs the question of why we would want to do that. If coherence is no longer a desiderata, it’s no longer clear that a utility function is that natural way to express preferences.
And I wonder, maybe this also applies to agents in general, or at least the kind of learned agents that humans are likely to build via gradient descent.
Maximization behavior
I think this matters, because many of the classic AI risk arguments go through a claim that maximization behavior is convergent. If you try to build a satisficer, there are a number of pressures for it to become a maximizer of some kind. (See this Rob Miles video, for instance)
I think that most arguments of that sort depend on an agent acting according to an expected utility maximization framework. And utility maximization turns out not to be a good abstraction for agents in the real world, I don’t know if these arguments are still correct.
I posit that straightforward maximizers are rare in the multiverse, and that most evolved or learned agents are better described by some other abstraction.
If not utility functions, then what?
If we accept for the time being that utility functions are a warped abstraction for most agents, what might a better abstraction be?
I don’t know. I’m writing this post in the hopes that others will think about this question and perhaps come up with productive alternative formulations.
I’ll post some of my half-baked thoughts on this question shortly.
I’ve long been somewhat skeptical that utility functions are the right abstraction.
My argument is also rather handwavy, being something like “this is the wrong abstraction for how agents actually function, so even if you can always construct a utility function and say some interesting things about its properties, it doesn’t tell you the thing you need to know to understand and predict how an agent will behave”. In my mind I liken it to the state of trying to code in functional programming languages on modern computers: you can do it, but you’re also fighting an uphill battle against the way the computer is physically implemented, so don’t be surprised if things get confusing.
And much like in the utility function case, people still program in functional languages because of the benefits they confer. I think the same is true of utility functions: they confer some big benefits when trying to reason about certain problems, so we accept the tradeoffs of using them. I think that’s fine so long as we have a morphism to other abstractions that will work better for understanding the things that utility functions obscure.
Utility functions are especially problematic in modeling behaviour for agents with bounded rationality, or those where there are costs of reasoning. These include every physically realizable agent.
For modelling human behaviour, even considering the ideals of what we would like human behaviour to achieve, there are even worse problems. We can hope that there is some utility function consistent with the behaviour we’re modelling and just ignore cases where there isn’t, but that doesn’t seem satisfactory either.
‘Or you will leave money on the table.’
You rotated ‘different’ and ‘between’. (Or a serious of rotations isomorphic to such.)