How Would an Utopia-Maximizer Look Like?

When we talk of aiming for the good future for humanity – whether by aligning AGI or any other way – it’s implicit that there are some futures that “humanity” as a whole would judge as good. That in some (perhaps very approximate) sense, humanity could be viewed as an agent with preferences, and that our aim is to satisfy said preferences.

But is there a theoretical basis for this? Could there be? How would it look like?

Is there a meaningful frame in which humanity be viewed as optimizing for its purported preferences across history?

Is it possible or coherent to imagine a wrapper-mind set to the task of maximizing for the utopia, whose activity we’d actually endorse?

This post aims to sketch out answers to these questions. In the process, it also outlines how my current models of basic value reflection and extrapolation work.


Informal Explanation

Basic Case

Is an utopia that’d be perfect for everyone possible?

The short and obvious answer is no. Our civilization contains omnicidal maniacs and true sadists, whose central preferences are directly at odds with the preferences of most other people. Their happiness is diametrically opposed to other people’s.

Less extremely, it’s likely that most individuals’ absolutely perfect world would fail to perfectly satisfy most others. As a safe example, we could imagine someone who loves pizza, yet really, really hates seafood, to such an extent that they’re offended by the mere knowledge that seafood exists somewhere in the world. Their utopia would not have any seafood anywhere – and that would greatly disappoint seafood-lovers. If we now postulate the existence of a pizza-hating seafood-lover… Well, it would seem that their utopias are directly at odds.[1]

Nevertheless, there are worlds that would make both of them happy enough. A world in which everyone is free to eat food that’s tasty according to their preferences, and is never forced to interact with the food they hate. Both people would still dislike the fact that their hated dishes exist somewhere. But as long as food-hating is not their core value that’s dominating their entire personality, they’d end up happy enough.

Similarly, it intuitively feels that worlds which are strictly better according to most people’s entire arrays of preferences are possible. Empowerment is one way to gesture at it – a world in which each individual is simply given more instrumental resources, a greater ability to satisfy whatever preferences they happen to have. (With some limitations on impacting other people, etc.)

But is it possible to arrive at this idea from first principles? By looking at humanity and somehow “eliciting”/​”agglomerating” its preferences formally? A process like CEV? A target to hit that’s “objectively correct” according to humanity’s own subjective values, rather than your subjective interpretation of its values?

Paraphrasing, we’re looking for an utility function such that the world-state maximizing it is ranked as very high by the standards of most humans’ preferences; an utility function that’s correlated with the “agglomeration” of most humans’ preferences.

Let’s consider what we did in the foods example. We discovered two disparate preferences, and then we abstracted up from them – from concrete ideas like “seafood” and “pizza”, to an abstraction over them: food-in-general. And we’ve discover that, although the individuals’ preferences disagreed on the concrete level, they ended up basically the same at the higher level. Trivializing, it turned out that a seafood-optimizer and a pizza-optimizer could both be viewed as tasty-food-optimizers.

The hypothesis, then, would go as follows: at some very high abstraction level, the level of global matters and fundamental philosophy, most humans’ preferences converge to the same utility function over some variable. For example, “maximize eudaimonia” or “human empowerment” or “human flourishing”.

There’s a counting argument that slightly supports this. Higher abstraction levels are less expressive: they include fewer objects/​variables (fewer countries than people, fewer stars than atoms, fewer galaxies than stars) and these objects have fewer states (fewer moods than the ways your brain’s atoms could be arrranged). So the mapping-up of values to them isn’t injective. Thus, some conflicting low-level preferences would map to the same preference over the same high-level variable.

That is, of course, a hypothesis. Nevertheless, the mere fact that we can coherently state it is reassuring regarding our ability to eventually test it.

Is Humanity an Utopia-Maximizer?

Maybe. I don’t strongly believe in this, but here’s a sketch:

If human values indeed converge like this, then perhaps humanity can be viewed as an approximate agent that’s been approximately optimizing for building an utopia for its entire history. But those “approximately” do a lot of work; there’s plenty of noise involved.

Primary issue is that the distribution of power between its constituents is non-uniform and changes dynamically. At different times, people with different preferences amass disproportionate amounts of resources (often by orders of magnitudes so), and “deviate” humanity’s path away from the hypothetical averaged-out course, in their individual preferred directions.

But the balance of power frequently changes, and how technologies change it is relatively unpredictable. So potentially these effects actually cancel out on average, and humanity stays roughly on-target? (Pizza-lovers being in charge for 100 years are replaced by seafood-lovers ruling for 100 years; and while they cancel out their specific preferences, both end up advancing humanity towards having tasty food around.)

It would explain why our world, like, actually does mostly get better over time. As well as provide some grounding to the ideas of “moral progress”.

Nevertheless, the approximations there may be extremely noisy, to the point that looking at things this way may not be useful.

Is an Utopia-Maximizer Desirable?

Assuming this hypothetical utopian utility function exists and we derive it, would it be possible to then plug it into some idealized agent/​wrapper-mind, and not be horrified at the results?

On my view, the answer is obviously yes. There’s a bunch of confusions around this idea that I’d like to address; mainly around what “a fixed goal” implies.

Consider a paperclip-maximizer. It wants the universe to be full of paperclips. If it gets its way, it’d reassemble all matter, including itself, into them.

Note, however, that it would not necessarily aim to freeze them in time. Intuitively, it would be fine with the paperclips still orbiting each other, impacting each other, and so on. Moreover, by the very definition of “a paperclip”, there’d be all sorts of subatomic processes happening within them. The paperclip-maximizer would want those to run their natural course. Its utility would stay constant as that happens; invariant under these transformations of the world-state.

Similarly, the maximum of an utopia-maximizer would be defined over an enormous equivalence class of world-states. It would not aim to freeze humanity in time, or impose some specific unchanging social order, or tile the universe with copies of specific people that it deemed most optimal for experiencing happiness, etc.

Its utility would be invariant under individual humans changing over time, under them forging new relationships, under societal structures changing and events generally moving forward. As long as those processes don’t wander into some nightmarish outcomes. That’s the main function it’d provide: a sort of “safety net”, lower-bounding how bad things could get. (And currently, they are very, very bad.)

Indeed, being a wrapper-mind doesn’t even disqualify you from being a person (as nostalgebraist’s post claims). Your utility can be invariant and maximal under many possible internal states. You can grow and change as a person, even if you have a fixed hard-wired goal that you ultimately serve.

Similarly, it’s not unreasonable to suggest that most humans are (effectively isomorphic to) wrapper-minds.


Formal Model

Suppose that on your hands, you have an agent with a vast array of disparate preferences. It’s a mess. They’re stored in different formats (explicit vs. implicit, deontological vs. consequentialist, instrumental vs. terminal...), defined on different abstraction levels, often conflict with each other.

You want to optimize them, straighten them out. Resolve whatever conflicts they have, translate them to whatever domains you’re working in, extrapolate them (to plan for the long-term), concretize them (to figure out what specific actions a philosophy demands of you), agglomerate them...

Why? Performance optimization. Sure, you could just do babble-and-prune search on your world-model, figuring out what would satisfy those preferences by brute force. But that’d be ruinously compute-intensive. You’d like to cache some of them, derive some heuristics from them, resolve conflicts to stop wasting time on those, etc.

How can you sort it out? What target are you even aiming at?

Well, the purpose of utility functions/​preferences is to recommend what actions to take. Indeed, that’s their main contribution: they define a preference ordering over candidate plans/​actions, either directly (deontology), or by way of looking at what worlds a given action would bring about (consequentialism).

Thus, the correct process of value-system performance-optimization would be made up of transformations such that the preference ordering over actions is invariant under them. I. e., the value-optimized agent would always take the same actions in any given situation as the initial agent (if the latter were given sufficient time to think).

Let’s see where that can get us.

Deontological Preferences

To start off, deontological preferences are isomorphic to utility functions, and utility functions are isomorphic to deontological preferences. They’re related by the softmax function:

Take a given deontological rule, like “killing is bad”. Let’s say we view it as a constraint on the allowable actions; or, in other words, a probability distribution over your actions that “predicts” that you’re very likely/​unlikely to take specific actions. The above transform would let us translate it into an utility function over actions.

The other way around, an utility function can be viewed as defining some “target distribution” for the variable over which it’s defined. Maximizing expected utility would then be equivalent to minimizing the cross-entropy between that target distribution and the real distribution.

And that’s not simply an overly abstract trick: it’s how human minds are actually hypothesized to work. See Friston’s predictive-processing framework in neuroscience (you can start from these comments).

This also covers shards. They’re self-executing heuristics bidding for specific actions over others. Thus, each could be transformed into an utility function without loss of information.

That’s not at odds with how deontology is usually presented, either. Deontologists reject utility-maximization in the sense that they refuse to engage in utility-maximizing calculations using their conscious intelligence. But similar dynamics can still be at play “under the hood”.

Value Conflict Resolution

Imagine an agent having two utility functions, and . It’s optimizing for their sum, . If the values are in conflict, if taking an action that maximizes hurts and vice versa — well, one of them almost surely spits out a higher value, so the maximization of is still well-defined.

That’s roughly how humans do work in practice. If we face a value conflict, we hesitate a bit (calculating the sum, the “winner”), but ultimately end up taking some action that we endorse.

… unless we hesitate too long, and time chooses for us. Or if we know we have to take action fast, and so decide to use some very rough approximations – and potentially make a mistake which we later regret it.

Thus, there’s purely practical value in reducing the number of internal conflicts. Finding a value such that, for all situations, it has the same preference ordering as , but its computational complexity is much lower.

Value Extrapolation

Value extrapolation seems to be straightforward: it’s just the reflection of the fact that the world can be viewed as a series of hierarchical ever-more-abstract models.

  1. Suppose we have a low-level model of reality , with variables (atoms, objects, whatever).

  2. Suppose we “abstract up”, deriving a more simple model of the world , with variables. Each variable in it is an abstraction over some set of lower-level variables , such that .

    • Recap: Higher-level variables are, by definition, less expressive, i. e. the number of states they could be in is lower than the number of states the underlying system can be in. By the counting argument, that means their states are defined over (very large in practice) equivalence classes of low-level states.

    • Example: “I’m happy” is a high-level state that correspond to a combinatorially large number of configurations my body’s atoms can be in. Stipulating “I’m happy” only constrains my low-level state up to that equivalence class.

  3. We iterate, to , , …, . We derive increasingly more abstract models of the world.

    • Note: . Since each subsequent level is simpler, it contains fewer variables. People to social groups to countries to the civilization; atoms to molecules to macro-scale objects to astronomical objects; etc.

  4. Let’s define the function . I. e.: it returns a probability distribution over the low-level variables given the state of a high-level variable that abstracts over them.

    • Note: As per (2), that only constrains the low-level system to a (very large) equivalence class of states. (Though the distribution needn’t be uniform.)

    • Example: If the world economy is in this state, how happy my grandmother is likely to be?

  5. If we view our values as an utility function , we can “translate” our utility function from any to roughly as follows: .

    • (There’s a ton of complications there, but this expression conveys the core idea.)

… and then value extrapolation just naturally falls out of this.

Suppose we have a bunch of values at the th abstraction level. Once we start frequently reasoning at th level, we “translate” our values to it, and cache the resultant functions. Since the th level likely has fewer variables than th, the mapping-up is not injective: some values defined over different low-level variables end up translated to the same higher-level variable (“I like pizza and seafood” → “I like tasty food”, “I like Bob and Alice” → “I like people”). This effect only strengthens as we go up higher and higher. At , we can plausibly end up with only one variable we value (as previously speculated, “eudaimonia” or something).

Putting It Together

Suppose we have a human on our hands, and we want to compile all of their values into a highly abstract utility function that the human would endorse. To do so, we:

  • Transform all values into the same format. (Either utility functions or probability distributions; doesn’t really matter.)

  • Translate them around to reveal value conflicts.

  • Resolve those conflicts by finding equivalent-but-simpler utility functions.

  • Extrapolate them upwards, to the highest abstraction level.

  • We end up with[2] a distillation/​compilation of that human’s entire selfhood, in the format isomorphic to an utility function. The endpoint of their moral philosophy.

… if only it were this easy.

Major Problem: Meta-Preferences

Humans have preferences not only about object-level stuff, but also about the way they do the whole value-compilation process. The above model assumed an idealized process, in the sense of deriving an utility function that would always recommend the same actions as the initial array of values, but have dramatically lower computational complexity.

However, humans have meta-values that can express arbitrarily custom preferences regarding the process of value reflection itself. We might have preferences over...

  • … basic translations. E. g., a deontologist’s refusal to take money into account when choosing whose life to save. (Refusing to translate and account for that preference.)

  • … how we extrapolate things up the abstraction levels. E. g., “I’m not going to let my petty preferences impact the future of humanity”, such that you ignore your preference for pizza when defining the AGI’s utility function (rather than biasing it towards it).

  • … how we resolve value conflicts. E. g., if we have = “I want to be a good person” and = “I’d get a thrill out of stealing something”, we often wouldn’t just tweak such that it still fires, but only when stealing something wouldn’t be against the society’s interests. No: we just flat-out delete .

  • Etc.

These complications currently have me worried that there’s basically no way to elicit and compile a given human’s preferences except directly simulating their mind. No shortcuts whatsoever. (And then that simulation would be path-dependent, such that, depending on what stimuli you show the human in what order, they might end up at vastly-different-yet-equally-legitimate endpoints. But that’s a whole separate topic.)

Regardless, this doesn’t kill the core idea. I’m reasonably sure (something like) the procedures I’ve defined are still what humans use most of the time. But there are more complex cases where meta-preferences are involved, they’re often crucial, and I’m not sure there are elegant ways to handle them.

Egalitarian Agglomeration

Now onto the last step: how do we agglomerate values between different people? That is, suppose we’ve “compiled” the preferences of all individual people into a set of utility functions, and then picked just their most-abstract components, getting this set: . How do we transform that into ?

Well, ideally, it’ll turn out that . That’s the “strong” version of the “human value convergence hypothesis”.

What if not, though?

The naive idea would be to just proceed as we had before, and find a simpler function that recommends the same actions as the individual functions’ sum. But that has some undesirable properties, like a sensitivity to “utility monsters”. The Geometric Rationality sequence has made that point rather well.

Thus, a better target would be a function that’s equivalent to the product of individual humans’ utility functions. It effectively maximizes the expected utility of a randomly-chosen human; thus, it aims to uniformly distribute utility across everyone. (I really recommend reading the Geometric Rationality sequence.)

And that result is, theoretically,

  • An utility function that humanity-as-a-whole could be said to have been (very roughly) maximizing throughout its history.

  • An utility function that something like CEV might spit out.

  • An utility function whose maximization would rank high by most individual humans’ preferences/​utility functions.

  • An utility function we could hook up to a wrapper-mind, and then be happy with the result.

  1. ^

    I’m sure you can come up with less tame examples from, say, politics or social issues. Fill them in as needed.

  2. ^

    Well, that was a simplified description of the process. In practice, you’d need to mix these steps up repeatedly.