The human utility hypothesis is much more vague than the others, and seems ultimately context-dependent. To my knowledge, the main argument in its favor is the fact that most of economics is founded on it.
I would say, rather, that the arguments in its favor are the same ones which convinced economists.
Humans aren’t well-modeled as perfect utility maximizers, but utility theory is a theory of what we can reflectively/coherently value. Economists might have been wrong to focus only on rational preferences, and have moved toward prospect theory and the like to remedy this. But it may make sense to think of alignment in these terms nonetheless.
I am not saying that it does make sense—I’m just saying that there’s a much better argument for it than “the economists did it”, and I really don’t think prospect theory addresses issues which are of great interest to alignment.
If a system is trying to align with idealized reflectively-endorsed values (similar to CEV), then one might expect such values to be coherent. The argument for this position is the combination of the various arguments for expected utility theory: VNM; money-pump arguments; the various dutch-book arguments; Savage’s theorem; the Jeffrey-Bolker theorem; the complete class theorem. One can take these various arguments and judge them on their own terms (perhaps finding them lacking).
Arguably, you can’t fully align with inconsistent preferences; if so, one might argue that there is no great loss in making a utility-theoretic approximation of human preferences: it would be impossible to perfectly satisfy inconsistent preferences anyway, so representing them by a utility function is a reasonable compromise.
In aligning with inconsistent preferences, the question seems to be what standards to hold a system to in attempting to do so. One might argue that the standards of utility theory are among the important ones; and thus, that the system should attempt to be consistent even if humans are inconsistent.
On the other hand, there are no strong arguments for representing human utility via prospect theory. It holds up better in experiments than utility theory does, but not so well that we would want to make it a bedrock assumption of alignment. The various arguments for expected utility make me somewhat happy for my preferences to be represented utility-theoretically even though they are not really like this; but, there is no similar argument in favor of a prospect-theoretic representation of my preferences. Essentially, I think one should either stick to a more-or-less utility-theoretic framework, or resort to taking a much more empirical approach where human preferences are learned in all their inconsistent detail (without a background assumption such as prospect theory).
That’s still a false dichotomy, but I think it is an appropriate response to many critiques of utility theory.
I don’t think you’re putting enough weight on what REALLY convinced economists, which was the tractability that assuming utility provides, and their enduring physics envy. (But to be fair, who wouldn’t wish that their domain was as tractable as Newtonian physics ended up being.)
But yes, Utility is a useful enough first approximation for humans that it’s worth using as a starting point. But only as a starting point. Unfortunately, too many economists are instead busy building castles on their assumptions, without trying to work with better approximations. (Yes, prospect theory and related. But it’s hard to do the math, so micro-economic foundations of macroeconomics mostly just aren’t being rebuilt.)
I certainly agree that this isn’t a good reason to consider human inability to approximate a utility function when looking at modeling AGI. But it’s absolutely critical when discussing what we’re doing to align with human “values,” and figuring out what that looks like. That’s why I think that far more discussion on this is needed.
Yeah, I don’t 100% buy the arguments which I gave in bullet-points in my previous comment.
But I guess I would say the following:
I expect to basically not buy any descriptive theory of human preferences. It doesn’t seem likely we could find super-prospect theory which really successfully codified the sort of inconsistencies which we see in human values, and then reap some benefits for AI alignment.
So it seems like what you want to do instead is make very few assumptions at all. Assume that the human can do things like answer questions, but don’t expect responses to be consistent even in the most basic sense of “the same answer to the same question”. Of course, this can’t be the end of the story, since we need to have a criterion—what it means to be aligned with such a human. But hopefully the criterion would also be as agnostic as possible. I don’t want to rely on specific theories of human irrationality.
So, when you say you want to see more discussion of this because it is “absolutely critical”, I am curious about your model of what kind of answers are possible and useful.
My current best-understanding is that if we assume people have arbitrary inconsistencies, it will be impossible to do better than satisfice on different human values by creating near-pareto improvements for intra-human values. But inconsistent values don’t even allow pareto-improvements! Any change makes things incomparable. Given that, I think we do need a super-prospect theory that explains in a systematic way what humans do “wrong” so that we can pick what an AI should respect of human preferences, and what can be ignored.
For instance, I love my children, and I like chocolate. I’m also inconsistent with my preferences in ways that differs; at a given moment of time, I’m much more likely to be upset with my kids and not want them around than I am to not want chocolate. I want the AI to respect my greater but inconsistent preference for my children over the more consistent preference for candy. I don’t know how to formalize this in a way that would generalize, which seems like a problem. Similar problems exist for time preference and similar typical inconsistencies—they are either inconsistent, or at least can be exploited by an AI that has a model which doesn’t think about resolving those inconsistencies.
With a super-prospect theory, I would hope we may be able to define a CEV or similar, which allows large improvements by ignoring the fact that those improvements are bad for some tiny part of my preferences. And perhaps the AI should find the needed super-prospect theory and CEV—but I am deeply unsure about the safety of doing this, or the plausibility of trying to solve it first.
(Beyond this, I think we need to expect that between-human values will differ, and we can keep things safe by insisting on a near-pareto improvement, only things that are a pareto improvement with respect to a very large portion of people, and relatively minor dis-improvements for the remainder. But that’s a different discussion.)
If a system is trying to align with idealized reflectively-endorsed values (similar to CEV), then one might expect such values to be coherent.
That’s why I distinguished between the hypotheses of “human utility” and CEV. It is my vague understanding (and I could be wrong) that some alignment researchers see it as their task to align AGI with current humans and their values, thinking the “extrapolation” less important or that it will take care of itself, while others consider extrapolation an important part of the alignment problem. For the former group, human utility is more salient, while the latter probably cares more about the CEV hypothesis (and the arguments you list in favor of it).
Arguably, you can’t fully align with inconsistent preferences
My intuitions tend to agree, but I’m also inclined to ask “why not?” e.g. even if my preferences are absurdly cyclical, but we get AGI to imitate me perfectly (or me + faster thinking + more information), under what sense of the word is it “unaligned” with me? More generally, what is it about these other coherence conditions that prevent meaningful “alignment”? (Maybe it takes a big discursive can of worms, but I actually haven’t seen this discussed on a serious level so I’m quite happy to just read references).
Essentially, I think one should either stick to a more-or-less utility-theoretic framework, or resort to taking a much more empirical approach where human preferences are learned in all their inconsistent detail (without a background assumption such as prospect theory).
That’s still a false dichotomy, but I think it is an appropriate response to many critiques of utility theory.
Hadn’t thought about it this way. Partially updated (but still unsure what I think).
I didn’t reply to this originally, probably because I think it’s all pretty reasonable.
That’s why I distinguished between the hypotheses of “human utility” and CEV. It is my vague understanding (and I could be wrong) that some alignment researchers see it as their task to align AGI with current humans and their values, thinking the “extrapolation” less important or that it will take care of itself, while others consider extrapolation an important part of the alignment problem.
My thinking on this is pretty open. In some sense, everything is extrapolation (you don’t exactly “currently” have preferences, because every process is expressed through time...). But OTOH there may be a strong argument for doing as little extrapolation as possible.
My intuitions tend to agree, but I’m also inclined to ask “why not?” e.g. even if my preferences are absurdly cyclical, but we get AGI to imitate me perfectly (or me + faster thinking + more information)
Well, imitating you is not quite right. (EG, the now-classic example introduced with the CIRL framework: you want the AI to help you make coffee, not learn to drink coffee itself.) Of course maybe it is imitating you at some level in its decision-making, like, imitating your way of judging what’s good.
under what sense of the word is it “unaligned” with me?
I’m thinking things like: will it disobey requests which it understands and is capable of? Will it fight you? Not to say that those things are universally wrong to do, but they could be types of alignment we’re shooting for, and inconsistencies do seem to create trouble there. Presumably if we know that it might fight us, we would want to have some kind of firm statement about what kind of “better” reasoning would make it do so (e.g., it might temporarily fight us if we were severely deluded in some way, but we want pretty high standards for that).
“Arguably, you can’t fully align with inconsistent preferences”
My intuitions tend to agree, but I’m also inclined to ask “why not?” e.g. even if my preferences are absurdly cyclical, but we get AGI to imitate me perfectly (or me + faster thinking + more information), under what sense of the word is it “unaligned” with me? More generally, what is it about these other coherence conditions that prevent meaningful “alignment”? (Maybe it takes a big discursive can of worms, but I actually haven’t seen this discussed on a serious level so I’m quite happy to just read references).
I’ve been thinking about whether you can have AGI that only aims for pareto-improvements, or a weaker formulation of that, in order to align with inconsistent values among groups of people. This is strongly based on Eric Drexler’s thoughts on what he has called “pareto-topia”. (I haven’t gotten anywhere thinking about this because I’m spending my time on other things.)
Yeah, I think something like this is pretty important. Another reason is that humans inherently don’t like to be told, top-down, that X is the optimal solution. A utilitarian AI might redistribute property forcefully, where a pareto-improving AI would seek to compensate people.
An even more stringent requirement which seems potentially sensible: only pareto-improvements which both parties both understand and endorse. (IE, there should be something like consent.) This seems very sensible with small numbers of people, but unfortunately, seems infeasible for large numbers of people (given the way all actions have side-effects for many many people).
I would say, rather, that the arguments in its favor are the same ones which convinced economists.
Humans aren’t well-modeled as perfect utility maximizers, but utility theory is a theory of what we can reflectively/coherently value. Economists might have been wrong to focus only on rational preferences, and have moved toward prospect theory and the like to remedy this. But it may make sense to think of alignment in these terms nonetheless.
I am not saying that it does make sense—I’m just saying that there’s a much better argument for it than “the economists did it”, and I really don’t think prospect theory addresses issues which are of great interest to alignment.
If a system is trying to align with idealized reflectively-endorsed values (similar to CEV), then one might expect such values to be coherent. The argument for this position is the combination of the various arguments for expected utility theory: VNM; money-pump arguments; the various dutch-book arguments; Savage’s theorem; the Jeffrey-Bolker theorem; the complete class theorem. One can take these various arguments and judge them on their own terms (perhaps finding them lacking).
Arguably, you can’t fully align with inconsistent preferences; if so, one might argue that there is no great loss in making a utility-theoretic approximation of human preferences: it would be impossible to perfectly satisfy inconsistent preferences anyway, so representing them by a utility function is a reasonable compromise.
In aligning with inconsistent preferences, the question seems to be what standards to hold a system to in attempting to do so. One might argue that the standards of utility theory are among the important ones; and thus, that the system should attempt to be consistent even if humans are inconsistent.
To the extent that human preferences are inconsistent, it may make more sense to treat humans as fragmented multi-agents, and combine the preferences of the sub-agents to get an overall utility function—essentially aligning with one inconsistent human the same way one would align with many humans. This approach might be justified by Harsanyi’s theorem.
On the other hand, there are no strong arguments for representing human utility via prospect theory. It holds up better in experiments than utility theory does, but not so well that we would want to make it a bedrock assumption of alignment. The various arguments for expected utility make me somewhat happy for my preferences to be represented utility-theoretically even though they are not really like this; but, there is no similar argument in favor of a prospect-theoretic representation of my preferences. Essentially, I think one should either stick to a more-or-less utility-theoretic framework, or resort to taking a much more empirical approach where human preferences are learned in all their inconsistent detail (without a background assumption such as prospect theory).
That’s still a false dichotomy, but I think it is an appropriate response to many critiques of utility theory.
I don’t think you’re putting enough weight on what REALLY convinced economists, which was the tractability that assuming utility provides, and their enduring physics envy. (But to be fair, who wouldn’t wish that their domain was as tractable as Newtonian physics ended up being.)
But yes, Utility is a useful enough first approximation for humans that it’s worth using as a starting point. But only as a starting point. Unfortunately, too many economists are instead busy building castles on their assumptions, without trying to work with better approximations. (Yes, prospect theory and related. But it’s hard to do the math, so micro-economic foundations of macroeconomics mostly just aren’t being rebuilt.)
I certainly agree that this isn’t a good reason to consider human inability to approximate a utility function when looking at modeling AGI. But it’s absolutely critical when discussing what we’re doing to align with human “values,” and figuring out what that looks like. That’s why I think that far more discussion on this is needed.
Yeah, I don’t 100% buy the arguments which I gave in bullet-points in my previous comment.
But I guess I would say the following:
I expect to basically not buy any descriptive theory of human preferences. It doesn’t seem likely we could find super-prospect theory which really successfully codified the sort of inconsistencies which we see in human values, and then reap some benefits for AI alignment.
So it seems like what you want to do instead is make very few assumptions at all. Assume that the human can do things like answer questions, but don’t expect responses to be consistent even in the most basic sense of “the same answer to the same question”. Of course, this can’t be the end of the story, since we need to have a criterion—what it means to be aligned with such a human. But hopefully the criterion would also be as agnostic as possible. I don’t want to rely on specific theories of human irrationality.
So, when you say you want to see more discussion of this because it is “absolutely critical”, I am curious about your model of what kind of answers are possible and useful.
My current best-understanding is that if we assume people have arbitrary inconsistencies, it will be impossible to do better than satisfice on different human values by creating near-pareto improvements for intra-human values. But inconsistent values don’t even allow pareto-improvements! Any change makes things incomparable. Given that, I think we do need a super-prospect theory that explains in a systematic way what humans do “wrong” so that we can pick what an AI should respect of human preferences, and what can be ignored.
For instance, I love my children, and I like chocolate. I’m also inconsistent with my preferences in ways that differs; at a given moment of time, I’m much more likely to be upset with my kids and not want them around than I am to not want chocolate. I want the AI to respect my greater but inconsistent preference for my children over the more consistent preference for candy. I don’t know how to formalize this in a way that would generalize, which seems like a problem. Similar problems exist for time preference and similar typical inconsistencies—they are either inconsistent, or at least can be exploited by an AI that has a model which doesn’t think about resolving those inconsistencies.
With a super-prospect theory, I would hope we may be able to define a CEV or similar, which allows large improvements by ignoring the fact that those improvements are bad for some tiny part of my preferences. And perhaps the AI should find the needed super-prospect theory and CEV—but I am deeply unsure about the safety of doing this, or the plausibility of trying to solve it first.
(Beyond this, I think we need to expect that between-human values will differ, and we can keep things safe by insisting on a near-pareto improvement, only things that are a pareto improvement with respect to a very large portion of people, and relatively minor dis-improvements for the remainder. But that’s a different discussion.)
That all seems pretty fair.
That’s why I distinguished between the hypotheses of “human utility” and CEV. It is my vague understanding (and I could be wrong) that some alignment researchers see it as their task to align AGI with current humans and their values, thinking the “extrapolation” less important or that it will take care of itself, while others consider extrapolation an important part of the alignment problem. For the former group, human utility is more salient, while the latter probably cares more about the CEV hypothesis (and the arguments you list in favor of it).
My intuitions tend to agree, but I’m also inclined to ask “why not?” e.g. even if my preferences are absurdly cyclical, but we get AGI to imitate me perfectly (or me + faster thinking + more information), under what sense of the word is it “unaligned” with me? More generally, what is it about these other coherence conditions that prevent meaningful “alignment”? (Maybe it takes a big discursive can of worms, but I actually haven’t seen this discussed on a serious level so I’m quite happy to just read references).
Hadn’t thought about it this way. Partially updated (but still unsure what I think).
I didn’t reply to this originally, probably because I think it’s all pretty reasonable.
My thinking on this is pretty open. In some sense, everything is extrapolation (you don’t exactly “currently” have preferences, because every process is expressed through time...). But OTOH there may be a strong argument for doing as little extrapolation as possible.
Well, imitating you is not quite right. (EG, the now-classic example introduced with the CIRL framework: you want the AI to help you make coffee, not learn to drink coffee itself.) Of course maybe it is imitating you at some level in its decision-making, like, imitating your way of judging what’s good.
I’m thinking things like: will it disobey requests which it understands and is capable of? Will it fight you? Not to say that those things are universally wrong to do, but they could be types of alignment we’re shooting for, and inconsistencies do seem to create trouble there. Presumably if we know that it might fight us, we would want to have some kind of firm statement about what kind of “better” reasoning would make it do so (e.g., it might temporarily fight us if we were severely deluded in some way, but we want pretty high standards for that).
I’ve been thinking about whether you can have AGI that only aims for pareto-improvements, or a weaker formulation of that, in order to align with inconsistent values among groups of people. This is strongly based on Eric Drexler’s thoughts on what he has called “pareto-topia”. (I haven’t gotten anywhere thinking about this because I’m spending my time on other things.)
Yeah, I think something like this is pretty important. Another reason is that humans inherently don’t like to be told, top-down, that X is the optimal solution. A utilitarian AI might redistribute property forcefully, where a pareto-improving AI would seek to compensate people.
An even more stringent requirement which seems potentially sensible: only pareto-improvements which both parties both understand and endorse. (IE, there should be something like consent.) This seems very sensible with small numbers of people, but unfortunately, seems infeasible for large numbers of people (given the way all actions have side-effects for many many people).
See my other reply about pseudo-pareto improvements—but I think the “understood + endorsed” idea is really important, and worth further thought.