Thank you! I think itâs exactly the same kind of âconditioning my output on their outputâ that you were pointing to in your analogy to iterated games. And I expect thereâs a strong correspondence between âprogram equilibria where you only condition on predicted outputsâ and âiterated game equilibria that can form a stable loop.â
StrivingForLegibility
Thank you! Ideally, I think weâd all like a model of individual rationality that composes together into a nice model of group rationality. And geometric rationality seems like a promising step in that direction.
This might be a framing thing!
The background details Iâd been imagining are that Alive and Bob were in essentially identical situations before their interaction, and it was just luck that Alice and Bob got the capabilities they did.
Alice and Bob have two ways to convert tokens into money, and Iâd claim that any rational joint strategy involves only using Bobâs way. Aliceâs ability to convert tokens into pennies is a red herring that any rational group should ignore.
At that point, itâs just a bargaining game over how to split the $1,000,000,000. And I claim that game is symmetric, since theyâre both equally necessary for that surplus to come into existence.
If Bob had instead paid huge costs to create the ability to turn tokens into tens of millions of dollars, I totally think his costs should be repaid before splitting the remaining surplus fairly.
Limiting it to economic/âcomparable values is convenient, but also very inaccurate for all known agentsâutility is private and incomparable.
I think modeling utility functions as private information makes a lot of sense! One of the claims Iâm making in this post is that utility valuations can be elicited and therefore compared.
My go-to example of an honest mechanism is a second-price auction, which we know we can implement from within the universe. The bids serve as a credible signal of valuation, and if everyone follows their incentives theyâll bid honestly. The person that values the item the most is declared the winner, and economic surplus is maximized.
(Assuming some background facts, which arenât always true in practice, like everyone having enough money to express their preferences through bids. I used tokens in this example so that âwillingness to payâ and âability to payâ can always line up.)
We use the same technique when we talk about the gains from trade, which I think the Ultimatum game is intended to model. If a merchant values a shirt at $5, and I value it at $15, then thereâs $10 of surplus to be split if we can agree on a price in that range.
Bob values the tokens more than Alice does. We can tell because he can buy them from her at a price sheâs willing to accept. Side payments let us interpersonally compare valuations.
As I understand it, economic surplus isnât a subjective quantity. Itâs a measure of how much people would be willing to pay to go from the status quo to some better outcome. Which might start out as private information in peopleâs heads, but there is an objective answer and we can elicit the information needed to compute and maximize it.
a purely rational Alice should not expect/âdemand more than $1.00, which is the maximum she could get from the best possible (for her) split without side payments.
I donât know of any results that suggest this should be true! My understanding of the classic analysis of the Ultimatum game is that if Bob makes a take-it-or-leave-it offer to Alice, where she would receive any tiny amount of money like $0.01, she should take it because $0.01 is better than $0.
My current take is that CDT-style thinking has crippled huge parts of economics and decision theory. The agreement of both parties is needed for this $1,000,000,000 of surplus to exist, if either walk away they both get nothing. The Ultimatum game is symmetric and the gains should be split symmetrically.
If we actually found ourselves in this situation, would we actually accept $1 out of $1 billion? Is that how weâd program a computer to handle this situation on our behalf? Is that the sort of reputation weâd want to be known for?
The problem remains though: you make the ex ante call about which information to âdecision-relevantly update onâ, and this can be a wrong call, and this creates commitment races, etc.
My understanding is that commitment races only occur in cases where âinformation about the commitments made by other agentsâ has negative value for all relevant agents. (All agents are racing to commit before learning more, which might scare them away from making such a commitment.)
It seems like updateless agents should not find themselves in commitment races.
My impression is that we donât have a satisfactory extension of UDT to multi-agent interactions. But I suspect that the updateless response to observing âyour counterpart has committed to going Straightâ will look less like âSwerve, since thatâs the best responseâ and more like âgo Straight with enough probability that your counterpart wishes theyâd coordinated with you rather than trying to bully you.â
Offering to coordinate on socially optimal outcomes, and being willing to pay costs to discourage bullying, seems like a generalizable way for smart agents to achieve good outcomes.
Got it, thank you!
It seems like trapped priors and commitment races are exactly the sort of cognitive dysfunction that updatelessness would solve in generality.
My understanding is that trapped priors are a symptom of a dysfunctional epistemology, which over-weights prior beliefs when updating on new observations. This results in an agent getting stuck, or even getting more and more confident in their initial position, regardless of what observations they actually make.
Similarly, commitment races are the result of dysfunctional reasoning that regards accurate information about other agents as hazardous. It seems like the consensus is that updatelessness is the general solution to infohazards.
My current model of an âupdateless decision procedureâ, approximated on a real computer, is something like âa policy which is continuously optimized, as an agent has more time to think, and the agent always acts according to the best policy itâs found so far.â And I like the model you use in your report, where an ecosystem of participants collectively optimize a data structure used to make decisions.
Since updateless agents use a fixed optimization criterion for evaluating policies, we can use something like an optimization market to optimize an agentâs policy. It seems easy to code up traders that identify âpolicies produced by (approximations of) Bayesian reasoningâ, which I suspect wonât be subject to trapped priors.
So updateless agents seem like they should be able to do at least as well as updateful agents. Because they can identify updateful policies, and use those if they seem optimal. But they can also use different reasoning to identify policies like âpay Paul Ekman to drive you out of the desertâ, and automatically adopt those when they lead to higher EV than updateful policies.
I suspect that the generalization of updatelessness to multi-agent scenarios will involve optimizing over the joint policy space, using a social choice theory to score joint policies. If agents agree at the meta level about âhow conflicts of interest should be resolvedâ, then that seems like a plausible route for them to coordinate on socially optimal joint policies.
I think this approach also avoids the sky-rocketing complexity problem, if I understand the problem youâre pointing to. (I think the problem youâre pointing to involves trying to best-respond to another agentâs cognition, which gets more difficult as that agent becomes more complicated.)
The distinction between âsolving the problem for our priorâ and âsolving the problem for all priorsâ definitely helps! Thank you!
I want to make sure I understand the way youâre using the term updateless, in cases where the optimal policy involves correlating actions with observations. Like pushing a red button upon seeing a red light, but pushing a blue button upon seeing a blue light. It seems like (See Red â Push Red, See Blue â Push Blue) is the policy that CDT, EDT, and UDT would all implement.
In the way that I understand the terms, CDT and EDT are updateful procedures, and UDT is updateless. And all three are able to use information available to them. Itâs just that an updateless decision procedure always handles information in ways that are endorsed a priori. (True information can degrade the performance of updateful decision theories, but updatelessness implies infohazard immunity.)
Is this consistent with the way youâre describing decision-making procedures as updateful and updateless?
It also seems like if an agent is regarding some information as hazardous, that agent isnât being properly updateless with respect to that information. In particular, if it finds that itâs afraid to learn true information about other agents (such as their inclinations and pre-commitments), it already knows that it will mishandle that information upon learning it. And if it were properly updateless, it would handle that information properly.
It seems like we can use that âflinching away from true informationâ as a signal that weâd like to change the way our future self will handle learning that information. If our software systems ever notice themselves calculating a negative value of information for an observation (empirical or logical), the details of that calculation will reveal at least one counterfactual branch where theyâre mishandling that information. It seems like we should always be able to automatically patch that part of our policy, possibly using a commitment that binds our future self.
In the worst case, we should always be able to do what our ignorant self would have done, so information should never hurt us.
Got it, I think I understand better the problem youâre trying to solve! Itâs not just being able to design a particular software system and give it good priors, itâs also finding a framework thatâs robust to our initial choice of priors.
Is it possible for all possible priors to converge on optimal behavior, even given unlimited observations? Iâm thinking of Yudkowskyâs example of the anti-Occamian and anti-Laplacian priors: the more observations an anti-Laplacian agent makes, the further its beliefs go from the truth.
Iâm also surprised that dynamic stability leads to suboptimal outcomes that are predictable in advance. Intuitively, it seems like this should never happen.
It sounds like we already mostly agree!
I agree with Casparâs point in the article you linked: the choice of metric determines which decision theories score highly on it. The metric that I think points towards âgoing Straight sometimes, even after observing that your counterpart has pre-committed to always going Straightâ is a strategic one. If Alice and Bob are writing programs to play open-source Chicken on their behalf, then thereâs a program equilibrium where:
Both programs first try to perform a logical handshake, coordinating on a socially optimal joint policy.
This only succeeds if they have compatible notions of social optimality.
As a fallback, Aliceâs program adopts a policy which
Caps Bobâs expected payoff at what Bob would have received under Aliceâs notion of social optimality
Minus an extra penalty, to give Bob an incentive gradient to climb towards what Alice sees as the socially optimal joint policy
Otherwise maximizes Aliceâs payoff, given that incentive-shaping constraint
Bobâs fallback operates symmetrically, with respect to his notion of social optimality.
The motivating principle is to treat oneâs choice of decision theory as itself strategic. If Alice chooses a decision theory which never goes Straight, after making the logical observation that Bobâs decision theory always goes Straight, then Bobâs best response is to pick a decision theory that always goes straight and make that as obvious as possible to Aliceâs decision theory.
Whereas if Alice designs her decision theory to grant Bob the highest payoff when his decision theory legibly outputs Bobâs part of (what Alice sees as a socially optimal joint policy), then Bobâs best response is to pick a decision theory that outputs Bobâs part of and make that as obvious as possible to Aliceâs decision theory.
It seems like one general recipe for avoiding commitment races would be something like:
Design your decision theory so that no information is hazardous to it
We should never be willing to pay in order to not know certain implications of our beliefs, or true information about the world
Design your decision theory so that it is not infohazardous to sensible decision theories
Our counterparts should generally expect to benefit from reasoning more about us, because we legibly are trying to coordinate on good outcomes and we grant the highest payoffs to those that coordinate with us
If infohazard resistance is straightforward, then our counterpart should hopefully have that reflected in their prior.
Do all the reasoning you want about your counterpartâs decision theory
Itâs fine to learn that your counterpart has pre-committed to going Straight. Whatâs true is already so. Learning this doesnât force you to Swerve.
Plus, things might not be so bad! You might be a hypothetical inside your counterpartâs mind, considering how you would react to learning that theyâve pre-committed to going Straight.
Your actions in this scenario can determine whether it becomes factual or counterfactual. Being willing to crash into bullies can discourage them from trying to bully you into Swerving in the first place.
You might also discover good news about your counterpart, like that theyâre also implementing your decision theory.
If this were bad news, like for commitment-racers, weâd want to rethink our decision theory.
So we seem to face a fundamental trade-off between the information benefits of learning (updating) and the strategic benefits of updatelessness. If I learn the digit, I will better navigate some situations which require this information, but I will lose the strategic power of coordinating with my counterfactual self, which is necessary in other situations.
It seems like we should be able to design software systems that are immune to any infohazard, including logical infohazards.
If itâs helpful to act on a piece of information you know, act on it.
If itâs not helpful to act on a piece of information you know, act as if you didnât know it.
Ideally, we could just prove that âDecision Theory X never calculates a negative value of informationâ. But if needed, we could explicitly design a cognitive architecture with infohazard mitigation in mind. Some options include:
An âignore this information in this situationâ flag
Upon noticing âthis information would be detrimental to act on in this situationâ, we could decide to act as if we didnât know it, in that situation.
(I think this is one of the designs you mentioned in footnote 4.)
Cognitive sandboxes
Spin up some software in a sandbox to do your thinking for you.
The software should only return logical information that is true, and useful in your current situation
If it notices any hazardous information, it simply doesnât return it to you.
Upon noticing that a train of thought doesnât lead to any true and useful information, donât think about why that is and move on.
I agree with your point in footnote 4, that the hard part is knowing when to ignore information. Upon noticing that it would be helpful to ignore something, the actual ignoring seems easy.
To feed back, it sounds like âthinking more about what other agents will doâ can be infohazardous to some decision theories. In the sense that they sometimes handle that sort of logical information in a way that produces worse results than if they didnât have that logical information in the first place. They can sometimes regret thinking more.
It seems like it should always be possible to structure our software systems so that this doesnât happen. I think this comes at the cost of not always best-responding to other agentsâ policies.
In the example of Chicken, I think that looks like first trying to coordinate on a correlated strategy, like a 50â50 mix of (Straight, Swerve) and (Swerve, Straight). (First try to coordinate on a socially optimal joint policy.)
Supposing that failed, our software system could attempt to troubleshoot why, and discover that their counterpart has simply pre-committed to always going Straight. Upon learning that logical fact, I donât think the best response is to best-respond, i.e. Swerve. If weâre playing True Chicken, it seems like in this case we should go Straight with enough probability that our counterpart regrets not thinking more and coordinating with us.
Itâs certainly not looking very likely (> 80%) that ⊠in causal interactions [most superintelligences] can easily and âfresh-out-of-the-boxâ coordinate on Pareto optimality (like performing logical or value handshakes) without falling into commitment races.
What are some obstacles to superintelligences performing effective logical handshakes? Or equivalently, what are some necessary conditions that seem difficult to bring about, even for very smart software systems?
(My understanding of the term âlogical handshakeâ is as a generalization of the technique from the Robust Cooperation paper. Something like âI have a model of the other relevant decision-makers, and I will enact my part of the joint policy if Iâm sufficiently confident that theyâll all enact their part of .â Is that the sort of decision-procedure that seems likely to fall into commitment races?)
Thank you! Iâm interested in checking out earlier chapters to make sure I understand the notation, but hereâs my current understanding:
There are 7 axioms that go into Joyceâs representation theorem, and none of them seem to put any constraints on the set of actions available to the agent. So we should be able to ask a Joyce-rational agent to choose a policy for a game.
My impression of the representation theorem is that a formula like can represent a variety of decision theories. Including ones like CDT which are dynamically inconsistent: they have a well-defined answer to âwhat do you think is the best policyâ, and itâs not necessarily consistent with their answer to âwhat are you actually going to do?â
So it seems like the axioms are consistent with policy optimization, and theyâre also consistent with action optimization. We can ask a decision theory to optimize a policy using an analogous expression: .
It seems like we should be able to get a lot of leverage by imposing a consistency requirement that these two expressions line up. It shouldnât matter whether we optimize over actions or policies, the actions taken should be the same.
I donât expect that fully specifies how to calculate the counterfactual data structures and , even with Joyceâs other 7 axioms. But the first 7 didnât rule out dynamic or counterfactual inconsistency, and this should at least narrow our search down to decision theories that are able to coordinate with themselves at other points in the game tree.
Totally! The ecosystem I think youâre referring to is all of the programs which, when playing Chicken with each other, manage to play a correlated strategy somewhere on the Pareto frontier between (1,2) and (2,1).
Games like Chicken are actually what motivated me to think in terms of âcollaborating to build mechanisms to reshape incentives.â If both players choose their mixed strategy separately, thereâs an equilibrium where they independently mix (, ) between Straight and Swerve respectively. But sometimes this leads to (Straight, Straight) or (Swerve, Swerve), leaving both players with an expected utility of and wishing they could coordinate on Something Else Which Is Not That.
If they could coordinate to build a traffic light, they could correlate their actions and only mix between (Straight, Swerve) and (Swerve, Straight). A 50â50 mix of these two gives each player an expected utility of 1.5, which seems pretty fair in terms of the payoffs achievable in this game.
Anything thatâs mutually unpredictable and mutually observable can be use to correlate actions by different agents. Agents that can easily communicate can use cryptographic commitments to produce legibly fair correlated random signals.
My impression is that being able to perform logical handshakes creates program equilibria that can be better than any correlated equilibrium. When the traffic light says the joint strategy should be (Straight, Swerve), the player told to Swerve has an incentive to actually Swerve rather than go Straight, assuming the other player is going to be playing their part of the correlated equilibrium. But the same trick doesnât work in the Prisonersâ Dilemma: a traffic light announcing (Cooperate, Cooperate) doesnât give either player an incentive to actually play their part of that joint strategy. Whereas a logical handshake actually does reshape the playersâ incentives: they each know that if they deviate from Cooperation, their counterpart will too, and they both prefer (Cooperate, Cooperate) to (Defect, Defect).
I havenât found any results for the phrase âcorrelated program equilibriumâ, but cousin_it talks about the setup here:
AIs that have access to each otherâs code and common random bits can enforce any correlated play by using the quining trick from Re-formalizing PD. If they all agree beforehand that a certain outcome is âgood and fairâ, the trick allows them to âmutually precommitâ to this outcome without at all constraining their ability to aggressively play against those who didnât precommit. This leaves us with the problem of fairness.
This gives us the best of both worlds: the random bits can get us any distribution over joint strategies we want, and the logical handshake allows enforcement of that distribution so long as itâs better than each playerâs BATNA. My impression is that itâs not always obvious what each playerâs BATNA is, and in this sequence I recommend techniques like counterfactual mechanism networks to move the BATNA in directions that all players individually prefer and agree are fair.
But in the context of âdelegating your decision to a computer programâ, one reasonable starting BATNA might be âwhat would all delegates do if they couldnât read each otherâs source code?â A reasonable decision theory wouldnât give in to inappropriate threats, and this removes the incentive for other decision theories to make them towards us in the first place. In the case of Chicken, the closed-source answer might be something like the mixed strategy we mentioned earlier: (, ) mixture between Straight and Swerve.
Any logical negotiation needs to improve on this baseline. This can make it a lot easier for our decision theory to resist threats. Like in the next post, AliceBot can spin up an instance to negotiate with BobBot, and basically ignore the content of this negotiation. Negotiator AliceBot can credibly say to BobBot âlook, regardless of what you threaten in this negotiation, take a look at my code. Implementer AliceBot wonât implement any policy thatâs worse than the BATNA defined at that level.â And this extends recursively throughout the network, like if they perform multiple rounds of negotiation.
Iâd been thinking about âcleannessâ, but I think youâre right that âbeing oriented to what weâre even talking aboutâ is more important. Thank you again for the advice!
Thank you! I started writing the previous post in this sequence and decided to break the example off into its own post.
For anyone else looking for a TLDR: this is an example of how a network of counterfactual mechanisms can be used to make logical commitments for an arbitrary game.
Totally! One of the most impressive results Iâve seen for one-shot games is the Robust Cooperation paper studying the open-source Prisonersâ Dilemma, where each player delegates their decision to a program that will learn the exact source code of the other delegate at runtime. Even utterly selfish agents have an incentive to delegate their decision to a program like FairBot or PrudentBot.
I think the probabilistic element helps to preserve expected utility in cases where the demands from each negotiator exceed the total amount of resources being bargained over. If each precommits to demand $60 when splitting $100, deterministic rejection leads to ($0, $0) with 100% probability. Whereas probabilistic rejection calls for the evaluator to accept with probability slightly less than $40/â$60 66.67%. Accepting leads to a payoff of ($60, $40), for an expected joint utility of slightly less than ($40, $26.67).
I think there are also totally situations where the asymmetrical power dynamics youâre talking about mean that one agent gets to dictate terms and the other gets what they get. Such as âAlice gets to unilaterally decide how $100 will be split, and Bob gets whatever Alice gives him.â In the one-shot version of this with selfish players, Alice just takes the $100 and Bob gets $0. Any hope for getting a selfish Alice to do anything else is going to come from incentives beyond this one interaction.
My point is thereâs a very tenuous jump from us making decisions to how/âwhether to enforce our preferences on others.
I think the big link I would point to is âpolitics/âeconomics.â The spherical cows in a vacuum model of a modern democracy might be something like âa bunch of agents with different goals, that use voting as a consensus-building and standardization mechanism to decide what rules they want enforced, and contribute resources towards the costs of that enforcement.â
When it comes to notions of fairness, I think we agree that there is no single standard which applies in all domains in all places. I would frame it as an XKCD 927 situation, where there are multiple standards being applied in different jurisdictions, and within the same jurisdiction when it comes to different domains. (E.g. restitution vs damages.)
When it comes to a fungible resource like money or pie, I believe Yudkowskyâs take is âa fair split is an equal split of the resource itself.â One third each for three people deciding how to split a pie. There are well-defined extensions for different types of non-fungibility, and the type of âfairnessâ achieved seems to be domain-specific.
There are also results in game theory regarding âwhat does a good outcome for bargaining games look like?â These are also well-defined, and requiring different axioms leads to different bargaining solutions. My current favorite way of defining âfairnessâ for a bargaining game is the Kalai-Smorodinsky bargaining solution. At the meta-level Iâm more confident about the attractive qualities of Yudkowskyâs probabilistic rejection model. Which includes working pretty well even when participants disagree about how to define âfairnessâ, and not giving anyone an incentive to exaggerate what they think is fair for them to receive. (Source might contain spoilers for Project Lawful but Yudkowsky describes the probabilistic rejection model here, and I discuss it more here.)
Applying Yudkowskyâs Algorithm to the labor scenario you described might look like having more fairness-oriented negotiations about âunder what circumstances a worker can be firedâ, âwhat compensation fired workers can expect to receiveâ, and âhow much additional work can other workers be expected to perform without an increase in marginal compensation rate.â That negotiation might happen at the level of individual workers, unions, labor regulations, or a convoluted patchwork of those and more. I think historically weâve made significant gains in defining and enforcing standards for things like fair wages and adequate safety.
This might be a miscommunication, I meant something like âyou and I individually might agree that some cost-cutting measures are good and some cost-cutting measures are bad.â
Agents probably also have an instrumental reason to coordinate on defining and enforcing standards for things like fair wages and adequate safety, where some agents might otherwise have an incentive to enrich themselves at the expense of others.
I can answer this now!
Expected Utility, Geometric Utility, and Other Equivalent Representations
It turns out there are a large family of expectations we can use to build utility functions, including the arithmetic expectation E, the geometric expectation G, and the harmonic expectation H, and theyâre all equivalent models of VNM rationality! And we need something beyond that family like Scottâs G[E[U]] to formalize geometric rationality.
Thank you for linking to these different families of means! The quasi-arithmetic mean turned out to be exactly what I needed for this result.