Incorporating Justice Theory into Decision Theory
When someone wrongs us, how should we respond? We want to discourage this behavior, so that others find it in their interest to treat us well. And yet the goal should never be to “do something unpleasant to them”, for its deterrent effect. I’m persuaded by Yudkowsky’s take (source contains spoilers for Project Lawful, but it’s here):
If at any point you’re calculating how to pessimize a utility function, you’re doing it wrong. If at any point you’re thinking about how much somebody might get hurt by something, for a purpose other than avoiding doing that, you’re doing it wrong.
In other words, when someone is wronged, we want to search over ways to repair the harm done to them and prevent similar harm from happening in the future, rather than searching over ways to harm the perpetrator in return. If we require that a person who harms another pay some or all of the costs involved in repairing that harm, that also helps to align their incentives and discourages people from inefficiently harming each other in the first place.
Restitution and Damages
Our legal systems have all sorts of tools for handling these situations, and I want to point to two of them: restitution and damages. Restitution covers cases where one party is enriched at another’s expense. Damages cover situations where one party causes a loss or injury for another. Ideally, we’d like to make the wronged party at least as well-off as if they hadn’t been wronged in the first place.
Sometimes, a wronged party can be made whole. If SpaceX drops a rocket on my car, there’s an amount of money they could pay me where I feel like my costs have been covered. If SpaceX drops a rocket on an irreplaceable work of art or important landmark, there’s no amount of money that can make the affected parties whole. Not that they shouldn’t pay compensation and do their best to repair the harm done anyway. But some losses are irreversible, like the loss of something irreplaceable. And some losses are reversible, like the financial loss of a replaceable car and the costs associated with replacing it.
In a previous post, we looked at a couple of example games, where if Alice treats “Bob Defecting while Alice Cooperates” as creating a debt between them, she can employ a policy which incentivizes Bob to Cooperate and receive a fair share of the socially optimal outcome. And if Bob employs the same policy, this stabilizes that outcome as part of a Nash equilibrium. Importantly, the penalty Bob experiences for not treating Alice according to her notion of fairness is limited rather than unlimited.
We also looked at how Alice might enforce debts owed to Carol when interacting with Bob, and this can lend social legitimacy and help ensure these debts actually get paid. One function of governments is to create common knowledge around what actions lead to how much debt, and how this debt will be enforced. I claim that the economic concept of externalities is a useful lens for determining how much debt if any is created by an action.
Voluntarism and Counterfactual Negotiation
Suppose that you are in a position to gain $100,000 at the expense of $20,000 to someone else. Should you? It might be justified on utilitarian grounds, if you gain more utility than they lose. But it’s clearly not a Pareto improvement over doing nothing.
One major theme of Gaming the Future is that we should generally prefer voluntary interactions over involuntary interactions. A voluntary interaction is one where all parties involved could meaningfully “opt-out” if they wanted to. And so if the interaction takes place at all, it’s because all parties prefer it to happen. In other words, voluntary interactions lead to Pareto improvements.
I think one relevant question for deciding whether to profit at someone else’s expense is “what would it take for them to agree to that?” For example, the other person might reasonably ask that they be compensated for their $20,000 loss. And that the remaining $80,000 be split equally between both parties.
Ideally, all parties would be able to actually negotiate this before the decision was made. This internalizes the negative externality by bringing all affected parties into the decision-making process. But suppose that it’s impractical to negotiate ahead of time. I would still be in favor of a social norm that “some reversible losses may be imposed involuntarily on others, so long as those losses are indeed reversed and the remaining surplus is split fairly.” This also internalizes the negative externality, and leads to fair Pareto improvements.
Importantly, we probably don’t want to internalize all negative externalities. If Alice doesn’t like Bob’s haircut, this doesn’t mean Bob should have to pay damages to Alice. If Alice dislikes Bob’s haircut more than Bob likes it, there is an opportunity for Alice to pay Bob to change his hair and split the resulting economic surplus fairly. But the concept of boundaries helps to define who gets to make what decisions unilaterally in the absence of a negotiated agreement without incurring debt to others.
Parfit’s Hitchhiker and Positive Externalities
Suppose that you are in a position to create a huge financial windfall for someone else, but it requires that you pay a noticeable-but-much-smaller cost yourself. Should you? The utilitarians are still saying yes, but this also not a Pareto improvement over doing nothing. And worse, now your personal interests seem to be misaligned with the socially optimal action.
Arguably, you should just help them because you personally are better off if people generally pay small costs to bring others huge benefits, and their decisions are correlated with your decision. But also arguably, the beneficiaries should pay those costs and then some, to internalize the positive externality and align their benefactor’s local incentives with their own. If there was enough sensible decision theory in the water supply, everyone would find it intuitively obvious that you should pay Paul Ekman to drive you out of the desert. There are times when a person actively prefers not to be compensated for their help. (“That was a lovely meal grandma, what do I owe you?”) But especially when there are significant costs for doing the socially optimal thing, we should generally arrange for those costs to be paid and then some.
And again, just because Alice likes Bob’s haircut, that doesn’t necessarily mean she owes him compensation. It seems fair for boundaries to work both ways. Alice can offer to pay Bob to choose the way she likes, but it’s his choice in the property-rights sense of ownership.
Insurance as a Substitute for Good Decision-Making
In a saner world, it might be intuitively obvious that “you should repay those that create windfalls for you, conditional on their prediction that you would repay them.” That world might have already done the hard work of creating the common knowledge that nearly everyone would reason this way, repay their benefactors, and split the remaining surplus fairly.
Until we build dath ilan, there are incremental steps we can take to internalize externalities and align individual incentives with our collective interests. In the United States, drivers are required to carry insurance which will cover some or all the damages caused by their driving. We can expand this to a requirement that everyone carry liability insurance, to internalize negative externalities more generally. Similarly in the United States, everyone is required to carry health insurance. We could expand this requirement to other types of help one could receive, to internalize more positive externalities. I literally carry “being airlifted out of the desert” insurance because I work with medical and fire teams for festivals out in the desert, and my normal health insurance doesn’t cover those sorts of evacuations.
What makes a decision into an externality is that it affects someone who wasn’t involved in making the decision. A first approach to internalizing externalities might be to literally bring affected parties into the decision-making process, or imagine what it would have taken for all parties to agree. Both of these fail when dealing with holdouts, who insist on unfairly high gains from the interaction. But we can still treat others according to our own notions of fairness, or a standardized consensus notion of fairness if this is even more generous. And that seems like a pretty good way to calculate the amount of debt incurred if someone receives worse treatment than fairness requires.
Suppose that my $100,000 comes from finding ways to better serve the customers of my business, and someone else’s $20,000 loss comes from customers forsaking their competing business to patronise mine. I do not think that I owe the other business anything.
It would certainly be interesting if you did, and would probably promote competition a lot more than is currently the case. On the other hand measuring and attributing those effects would be extremely difficult and almost certainly easy to game.
I do not see how competition would be improved by requiring everyone who builds a better mousetrap to subsidise the manufacturers of inferior goods. One might as well require the winners of races to distribute their prize money to the losers.
I think we agree that in cases where competition is leading to good results, no change to the dynamics is called for.
We probably also agree on a lot of background value judgements like “when businesses become more competitive by spending less on things no one wants, like waste or pollution, that’s great!” And “when businesses become more competitive by spending less on things people want, like fair wages or adequate safety, that’s not great and intervention is called for.”
One case where we might literally want to distribute resources from the makers of a valuable product, to their competitors and society at large, is the development of Artificial General Intelligence (AGI). One of the big causes for concern here is that the natural dynamics might be winner-take-all, leading to an arms race that sacrifices spending on safety in favor of spending on increased capabilities or an earlier launch date.
If instead all AGI developers believed that the gains from AGI development would be spread out much more evenly, this might help to divert spending away from increasing capabilities and deploying as soon as possible, and towards making sure that deployment is done safely.
Many AI firms have already voluntarily signedWindfall Clauses, committing to share significant fractions of the wealth generated by successful AGI development.EDIT: At the time of writing, it looks like Windfall Clauses have been advocated for but not adopted. Thank you Richard_Kennaway for the correction!
I had not heard of the FHI’s Windfall Clause, but looking on the Internet, I don’t see signs of anyone signing up to it yet. Metaculus has a still-open prediction market on whether any major AI company will by the end of 2025.
Oops, when I heard about it I’d gotten the impression that this had been adopted by at least one AI firm, even a minor one, but I also can’t find anything suggesting that’s the case. Thank you!
It looks like OpenAI has split into a nonprofit organization and a “capped-profit” company.
OpenAI Nonprofit could act like the Future of Life Instutute’s proposed Windfall Trust, and a binding commitment to do so would be a Windfall Clause. They could also do something else prosocial with those profits, consistent with their nonprofit status.
Wait—now there’s an authority who can distinguish between good efficiency and bad efficiency? That’s a pretty long jump from the post about how individual agents should approach incentives and retribution for other agents.
This might be a miscommunication, I meant something like “you and I individually might agree that some cost-cutting measures are good and some cost-cutting measures are bad.”
Agents probably also have an instrumental reason to coordinate on defining and enforcing standards for things like fair wages and adequate safety, where some agents might otherwise have an incentive to enrich themselves at the expense of others.
I’m confused. Are you and/or I the people making the decisions about a given company competitiveness decision? My point is there’s a very tenuous jump from us making decisions to how/whether to enforce our preferences on others.
The framing of “fair”, “justice”, “good” and “bad” aren’t well-defined in terms of rationality or game theory. There is no “standardized consensus notion of fairness”. MOST actions are good for some individuals, bad for some, and neutral to a whole shitload.
Cost-cutting by firing someone and expecting the remaining workers to work a little more efficiently is a very good example. It’s good for the owners of the company, good for customers who focus on price, neutral to bad for customers who focus on service/touch (which requires more workers), good for workers who’d otherwise fare worse if the company went out of business, bad for workers who have to work harder for the same pay, and neutral to 9 billion uninvolved humans.
It’s VERY unclear how or whether this framework applies to any of the stakeholders.
I think the big link I would point to is “politics/economics.” The spherical cows in a vacuum model of a modern democracy might be something like “a bunch of agents with different goals, that use voting as a consensus-building and standardization mechanism to decide what rules they want enforced, and contribute resources towards the costs of that enforcement.”
When it comes to notions of fairness, I think we agree that there is no single standard which applies in all domains in all places. I would frame it as an XKCD 927 situation, where there are multiple standards being applied in different jurisdictions, and within the same jurisdiction when it comes to different domains. (E.g. restitution vs damages.)
When it comes to a fungible resource like money or pie, I believe Yudkowsky’s take is “a fair split is an equal split of the resource itself.” One third each for three people deciding how to split a pie. There are well-defined extensions for different types of non-fungibility, and the type of “fairness” achieved seems to be domain-specific.
There are also results in game theory regarding “what does a good outcome for bargaining games look like?” These are also well-defined, and requiring different axioms leads to different bargaining solutions. My current favorite way of defining “fairness” for a bargaining game is the Kalai-Smorodinsky bargaining solution. At the meta-level I’m more confident about the attractive qualities of Yudkowsky’s probabilistic rejection model. Which includes working pretty well even when participants disagree about how to define “fairness”, and not giving anyone an incentive to exaggerate what they think is fair for them to receive. (Source might contain spoilers for Project Lawful but Yudkowsky describes the probabilistic rejection model here, and I discuss it more here.)
Applying Yudkowsky’s Algorithm to the labor scenario you described might look like having more fairness-oriented negotiations about “under what circumstances a worker can be fired”, “what compensation fired workers can expect to receive”, and “how much additional work can other workers be expected to perform without an increase in marginal compensation rate.” That negotiation might happen at the level of individual workers, unions, labor regulations, or a convoluted patchwork of those and more. I think historically we’ve made significant gains in defining and enforcing standards for things like fair wages and adequate safety.
I love the probabalistic rejection idea—it’s clever and fun. But it depends a LOT on communication or repetition-with-identity so the offerer has any clue that’s the algorithm in play. And in that case, the probabalistic element is unnecessary—simple precommitment is enough (and, in strictly-controlled games without repetition, allowing the reponder to publicly and enforceably precommit just reverses the positions).
I think our main disagreement is on what to do when one or more participants in one-shot (or fixed-length) games are truly selfish, and the payouts listed are fully correct in utility, after accounting for any empathy or desire for fairness. Taboo “fair”, and substitute “optimizing for self”. Shapley values are a good indicator of bargaining power for some kinds of game, but the assumption of symmetry is hard to justify.
Totally! One of the most impressive results I’ve seen for one-shot games is the Robust Cooperation paper studying the open-source Prisoners’ Dilemma, where each player delegates their decision to a program that will learn the exact source code of the other delegate at runtime. Even utterly selfish agents have an incentive to delegate their decision to a program like FairBot or PrudentBot.
I think the probabilistic element helps to preserve expected utility in cases where the demands from each negotiator exceed the total amount of resources being bargained over. If each precommits to demand $60 when splitting $100, deterministic rejection leads to ($0, $0) with 100% probability. Whereas probabilistic rejection calls for the evaluator to accept with probability slightly less than $40/$60 ≈ 66.67%. Accepting leads to a payoff of ($60, $40), for an expected joint utility of slightly less than ≈ ($40, $26.67).
I think there are also totally situations where the asymmetrical power dynamics you’re talking about mean that one agent gets to dictate terms and the other gets what they get. Such as “Alice gets to unilaterally decide how $100 will be split, and Bob gets whatever Alice gives him.” In the one-shot version of this with selfish players, Alice just takes the $100 and Bob gets $0. Any hope for getting a selfish Alice to do anything else is going to come from incentives beyond this one interaction.
That sounds reasonable to me! This could be another negative externality that we judge to be acceptable, and that we don’t want to internalize. Something like “if you break any of these rules, (e.g. worker safety, corporate espionage, etc.) then you owe the affected parties compensation. But as long as you follow the rules, there is no consensus-recognized debt.”
I think this is very often true, but not actually universal. There are LOTS of cases where repairing the harm is impossible, and more importantly cases where we want to disincentivize the behavior MORE than just this instance caused harm.
A difficult case for your theory is when a child skips school or verbally abuses a classmate. Actual punishment is (often) the right response. Likewise thefts that are rarely caught—if you want it to be rare, you need to punish enough for all the statistical harms that the caught criminal is, on average, responsible for. And since they don’t have enough to actually make it right by repaying, you punish them by other means.
The missing piece from this sequence is “power”. You’re picking games that have a symmetry or setup which makes your definition of “fair” enforceable. For games without these mechanisms, the rational outcomes don’t end up that pleasant. Except sometimes, with players who have extra-rational motives.
Add to this the fact that the human population varies WIDELY on these dimensions, and many DO seek retribution as the primary signal (for their behaviors, and to impose their preferences on others).
I think we agree that if a selfish agent needs to be forced to not treat others poorly, in the absence of such enforcement they will treat others poorly.
It also seems like in many cases, selfish agents have an incentive to create exactly those mechanisms ensuring good outcomes for everyone, because it leads to good outcomes for them in particular. A nation-state comprised entirely of very selfish people would look a lot different from any modern country, but they face the same instrumental reasons to pool resources to enforce laws. The more inclined their populace is towards mistreatment in the absence of enforcement, the higher those enforcement costs need to be in order to achieve the same level of good treatment.
I also think “fairness” is a Schelling point that even selfish agents can agree to coordinate around, in a way that they could never be aligned on “maximizing Zaire’s utility in particular.” They don’t need to value fairness directly to agree that “an equal split of resources is the only compromise we’re all going to agree on during this negotiation.”
So I think my optimism comes from at least two places:
Even utterly selfish agents still have an incentive to create mechanisms enforcing good outcomes for everyone.
People have at least some altruism, and are willing to pay costs to prevent mistreatment of others in many cases.
Skipping school is a victimless action, and I’d expect that searching over ways to cause the person to learn would almost never use punishment as the response, because it doesn’t produce incentive to learn. School incentive design is an active area of political discussion in the world right now, eg I’m a fan of tht Human Restoration Project for their commentary on and sketches of such things.
On the main topic, I might agree with you, not sure yet, it seems to me the punishment case should be generated by searching for how to prevent recurrence in any circumstance where there isn’t another way to prevent recurrence, right?
Not quite. How to minimize similar choices in future equilibrium, maybe. In many cases, how to maximize conformance and compliance to a set of norms, rather than just this specific case. In real humans (not made-up rationalist cooperators), it includes how to motivate people to behave compatibly with your worldview, even though they think differently enough from you that you can’t fully model them. Or don’t have the bandwidth to understand them well enough to convince them. Or don’t have the resources to satisfy their needs such that they’d be willing to comply.
but I don’t see how that precludes searching for alternatives to retribution first
I don’t mean to argue against searching for (and in fact using) alternatives. I merely mean to point out that there seem to be a lot of cases in society where we haven’t found effective alternatives to punishment. It’s simply incorrect for the OP to claim that the vision of fiction is fully applicable to the real world.
ah, I see—if it turns out OP was arguing for that, then I misunderstood something. the thing I understood OP to be saying is about the algorithm for how to generate responses—that it should not be retribution-seeking, but rather solution-seeking, and it should likely have a penalty for selecting retribution, but it also likely does need to be able to select retribution to work in reality, as you say. OP’s words, my italics:
implication I read: prevent similar harm is allowed to include paths that harm the perpetrator, but it’s searching over ?worldlines? based on those ?worldlines? preventing recurrence, rather than just because they harm the perpetrator.