Here’s an idea about how to formally specify society-wide optimization, given that we know the utility function of each individual. In particular, it might be useful for multi-user AI alignment.
A standard tool for this kind of problem is Nash bargaining. The main problem with it is that it’s unclear how to choose the BATNA (disagreement point). Here’s why some simple proposals don’t work:
One natural BATNA for any game is assigning each player their maximin payoff. However, for a group of humans it means something horrible: Alice’s maximin is a situation in which everyone except Alice are doing their best to create the worst possible world for Alice. This seems like an unhealthy and unnatural starting point.
Another natural BATNA is the world in which no humans exist at all. The problem with this is: suppose there is one psychopath who for some reason prefers everyone not to exist. Then, there are no Pareto improvements over the BATNA, and therefore this empty world is already the “optimum”. The same problem applies to most choices of BATNA.
Here is my proposal. We define the socially optimal outcome by recursion over the number of people n. For n=1, we obviously just optimize the utility function of the lone person. For a set of people P of cardinality n>1, let’s consider any given i∈P. The BATNA payoff of i is defined to be the minimum over all j∈P of the payoff of i in the socially optimal outcome of P∖j (we consider worlds in which j doesn’t exist). If there are multiple optimal outcomes, we minimize over them. Typically, the minimum is achieved for j=i but we can’t just set j=i in the definition, we need the minimization in order to make sure that the BATNA is always admissible[1]. We then do Nash bargaining with respect to this BATNA.
Good properties of this proposal:
The outcome is Pareto efficient. It is also “fair” in the sense that the specification is rather natural and symmetric.
The only especially strong assumption needed to make sense of the definition, is the ability to consider worlds in which some people don’t exist[2]. For example, we don’t need anything like transferable utility or money. [EDIT: See child comment for a discussion of removing this assumption.]
AFAICT threats don’t affect the outcome, since there’s no reference to minimax or Nash equilibria.
Most importantly, it is resistant to outliers:
For example, consider a world with a set S of selfish people and 1 psychopath who we denote y. The outcome space is 2S⊔{y}: each person either exists or not. A selfish person gets payoff 1 for existing and payoff 0 for non-existing. The psychopath’s payoff is minus the number of people who exists. Let n be the cardinality of S. Then, we can check that the socially optimal outcome gives each selfish person a payoff of nn+1 (i.e. they exist with this probability).
In the above example, if we replace the selfish people with altruists (whose utility function is the number of altruists that exist) the outcome is even better. The expected number of existing altruists is (1−1(n+1)!)n.
Using Nash with maximin as the BATNA has some big advantages
it really motivates bargaining, as there are usually pareto improvements that are obvious, and near-pareto improvements beyond even that.
It’s literally impossible to do worse for any given individual. If you’re worried about the experience of the most unlucky/powerless member, this ensures you won’t degrade it with your negotiation.
I’m trying to compare your proposal to https://en.wikipedia.org/wiki/Shapley_value. On the surface, it seems similar—consider sub-coalitions to determine counterfactual contribution (doesn’t matter what the contribution unit is—any linearly aggregatable quantity, whether Utility or dollars should work).
I do worry a bit that in both Shapely and your system, it is acceptible to disappear people—the calculation where they don’t exist seems problematic when applied to actual people. It has the nice property of ignoring “outliers” (really, negative-value lives), but that’s only a nice property in theory, it would be horrific if actually applied.
it really motivates bargaining, as there are usually pareto improvements that are obvious, and near-pareto improvements beyond even that.
I couldn’t really parse this. What does it mean to “motivate bargaining” and why is it good?
If you’re worried about the experience of the most unlucky/powerless member, this ensures you won’t degrade it with your negotiation.
In practice, it’s pretty hard for a person to survive on their own, so usually not existing is at least as good as the minimax (or at least it’s not that much worse). It can actually be way, way better than the minimax, since the minimax implies every other person doing their collective best to make things as bad as possible for this person.
There is a huge difference: Shapley value assumes utility is transferable, and I don’t.
I do worry a bit that in both Shapely and your system, it is acceptible to disappear people—the calculation where they don’t exist seems problematic when applied to actual people. It has the nice property of ignoring “outliers” (really, negative-value lives), but that’s only a nice property in theory, it would be horrific if actually applied.
By “outliers” I don’t mean negative-value lives, I mean people who want everyone else to die and/or to suffer.
It is not especially acceptable in my system to disappear people: it is an outcome that is considered, but it only happens if enough people have a sufficiently strong preference for it. I do agree it might be better to come up with a system that somehow discounts “nosy” preferences, i.e. doesn’t put much weight on what Alice thinks Bob’s life should look like when it contradicts what Bob wants.
By “motivate bargaining”, I meant that humans aren’t rational utility maximizers, and the outcomes they will seek and accept are different, depending on the framing of the question. If you tell them that the rational baseline is low (and prove it using a very small set of assumptions), they’re more likely to accept a wider range of better (but not as much better as pure manipulation might give them) outcomes.
By negative-value lives, I meant negative to the aggregate you’re maximizing, not negative to themselves. Someone who gains by others’ suffering necessarily reduces the sum. The assumption that not existing is an acceptable outcome to those participants still feels problematic to me, but I do agree that eliminating unpleasant utility curves makes the problem tractable.
When people are basic ontological entities for a decision theory, there is an option of setting up platonic worlds/environments for them and for interactions between their collections. This needs to add up to what happens in the physical world, but the intermediate constructions can run wild with many abstract/platonic/simulated worlds, for purposes of being valued by their preferences.
I didn’t get anything specific/nice this way, but it’s the way I’m thinking about boundaries, that agent’s viscera should be its own sovereign/private platonic world rather than something like a region of space that’s shared with other agents, or agent’s own internal details. And the physical world, or other worlds defined for interaction between agents, serve as boundaries between the agents, by virtue of reasoning about them and their viscera worlds in restricted ways that the boundary worlds as a whole precommit to respect.
It is possible to get rid of the need to consider worlds in which some players don’t exist, by treating P∖j as optimization for a subset of players. This can be meaningful in the context of a single entity (e.g. the AI) optimizing for the preferences of P∖j, or in the context of game-theory, where we interpret it as having all players coordinate in a manner that optimizes for the utilities of P∖j (in the latter context, it makes sense to first discard any outcome that assigns a below-minimax payoff to any player[1]). The disadvantage is, this admits BATNAs in which some people get worse-than-death payoffs (because of adversarial preferences of other people). On the other hand, it is still “threat resistant” in the sense that, the mechanism itself doesn’t generate any incentive to harm people.
It would be interesting to compare this with Diffractor’s ROSE point.
Regarded as a candidate definition for a fully-general abstract game-theoretic superrational optimum, this still seems lacking, because regarding the minimax in a game of more than two players seems too weak. Maybe there is a version based on some notion of “coalition minimax”.
Here’s an idea about how to formally specify society-wide optimization, given that we know the utility function of each individual. In particular, it might be useful for multi-user AI alignment.
A standard tool for this kind of problem is Nash bargaining. The main problem with it is that it’s unclear how to choose the BATNA (disagreement point). Here’s why some simple proposals don’t work:
One natural BATNA for any game is assigning each player their maximin payoff. However, for a group of humans it means something horrible: Alice’s maximin is a situation in which everyone except Alice are doing their best to create the worst possible world for Alice. This seems like an unhealthy and unnatural starting point.
Another natural BATNA is the world in which no humans exist at all. The problem with this is: suppose there is one psychopath who for some reason prefers everyone not to exist. Then, there are no Pareto improvements over the BATNA, and therefore this empty world is already the “optimum”. The same problem applies to most choices of BATNA.
Here is my proposal. We define the socially optimal outcome by recursion over the number of people n. For n=1, we obviously just optimize the utility function of the lone person. For a set of people P of cardinality n>1, let’s consider any given i∈P. The BATNA payoff of i is defined to be the minimum over all j∈P of the payoff of i in the socially optimal outcome of P∖j (we consider worlds in which j doesn’t exist). If there are multiple optimal outcomes, we minimize over them. Typically, the minimum is achieved for j=i but we can’t just set j=i in the definition, we need the minimization in order to make sure that the BATNA is always admissible[1]. We then do Nash bargaining with respect to this BATNA.
Good properties of this proposal:
The outcome is Pareto efficient. It is also “fair” in the sense that the specification is rather natural and symmetric.
The only especially strong assumption needed to make sense of the definition, is the ability to consider worlds in which some people don’t exist[2]. For example, we don’t need anything like transferable utility or money. [EDIT: See child comment for a discussion of removing this assumption.]
AFAICT threats don’t affect the outcome, since there’s no reference to minimax or Nash equilibria.
Most importantly, it is resistant to outliers:
For example, consider a world with a set S of selfish people and 1 psychopath who we denote y. The outcome space is 2S⊔{y}: each person either exists or not. A selfish person gets payoff 1 for existing and payoff 0 for non-existing. The psychopath’s payoff is minus the number of people who exists. Let n be the cardinality of S. Then, we can check that the socially optimal outcome gives each selfish person a payoff of nn+1 (i.e. they exist with this probability).
In the above example, if we replace the selfish people with altruists (whose utility function is the number of altruists that exist) the outcome is even better. The expected number of existing altruists is (1−1(n+1)!)n.
“Admissible” in the sense that, there exists a payoff vector which is a Pareto improvement over the BATNA and is actually physically realizable.
We also need to assume that we can actually assign utility functions to people, but I don’t consider it a “strong assumption” in this context.
Using Nash with maximin as the BATNA has some big advantages
it really motivates bargaining, as there are usually pareto improvements that are obvious, and near-pareto improvements beyond even that.
It’s literally impossible to do worse for any given individual. If you’re worried about the experience of the most unlucky/powerless member, this ensures you won’t degrade it with your negotiation.
I’m trying to compare your proposal to https://en.wikipedia.org/wiki/Shapley_value. On the surface, it seems similar—consider sub-coalitions to determine counterfactual contribution (doesn’t matter what the contribution unit is—any linearly aggregatable quantity, whether Utility or dollars should work).
I do worry a bit that in both Shapely and your system, it is acceptible to disappear people—the calculation where they don’t exist seems problematic when applied to actual people. It has the nice property of ignoring “outliers” (really, negative-value lives), but that’s only a nice property in theory, it would be horrific if actually applied.
I couldn’t really parse this. What does it mean to “motivate bargaining” and why is it good?
In practice, it’s pretty hard for a person to survive on their own, so usually not existing is at least as good as the minimax (or at least it’s not that much worse). It can actually be way, way better than the minimax, since the minimax implies every other person doing their collective best to make things as bad as possible for this person.
There is a huge difference: Shapley value assumes utility is transferable, and I don’t.
By “outliers” I don’t mean negative-value lives, I mean people who want everyone else to die and/or to suffer.
It is not especially acceptable in my system to disappear people: it is an outcome that is considered, but it only happens if enough people have a sufficiently strong preference for it. I do agree it might be better to come up with a system that somehow discounts “nosy” preferences, i.e. doesn’t put much weight on what Alice thinks Bob’s life should look like when it contradicts what Bob wants.
By “motivate bargaining”, I meant that humans aren’t rational utility maximizers, and the outcomes they will seek and accept are different, depending on the framing of the question. If you tell them that the rational baseline is low (and prove it using a very small set of assumptions), they’re more likely to accept a wider range of better (but not as much better as pure manipulation might give them) outcomes.
By negative-value lives, I meant negative to the aggregate you’re maximizing, not negative to themselves. Someone who gains by others’ suffering necessarily reduces the sum. The assumption that not existing is an acceptable outcome to those participants still feels problematic to me, but I do agree that eliminating unpleasant utility curves makes the problem tractable.
When people are basic ontological entities for a decision theory, there is an option of setting up platonic worlds/environments for them and for interactions between their collections. This needs to add up to what happens in the physical world, but the intermediate constructions can run wild with many abstract/platonic/simulated worlds, for purposes of being valued by their preferences.
I didn’t get anything specific/nice this way, but it’s the way I’m thinking about boundaries, that agent’s viscera should be its own sovereign/private platonic world rather than something like a region of space that’s shared with other agents, or agent’s own internal details. And the physical world, or other worlds defined for interaction between agents, serve as boundaries between the agents, by virtue of reasoning about them and their viscera worlds in restricted ways that the boundary worlds as a whole precommit to respect.
It is possible to get rid of the need to consider worlds in which some players don’t exist, by treating P∖j as optimization for a subset of players. This can be meaningful in the context of a single entity (e.g. the AI) optimizing for the preferences of P∖j, or in the context of game-theory, where we interpret it as having all players coordinate in a manner that optimizes for the utilities of P∖j (in the latter context, it makes sense to first discard any outcome that assigns a below-minimax payoff to any player[1]). The disadvantage is, this admits BATNAs in which some people get worse-than-death payoffs (because of adversarial preferences of other people). On the other hand, it is still “threat resistant” in the sense that, the mechanism itself doesn’t generate any incentive to harm people.
It would be interesting to compare this with Diffractor’s ROSE point.
Regarded as a candidate definition for a fully-general abstract game-theoretic superrational optimum, this still seems lacking, because regarding the minimax in a game of more than two players seems too weak. Maybe there is a version based on some notion of “coalition minimax”.