StrivingForLegibility

Karma: 376

StrivingForLegibility Feb 8, 2024, 11:29 PM
1 point
0
on: Updatelessness doesn’t solve most problems
It’s certainly not looking very likely (> 80%) that … in causal interactions [most superintelligences] can easily and “fresh-out-of-the-box” coordinate on Pareto optimality (like performing logical or value handshakes) without falling into commitment races.
What are some obstacles to superintelligences performing effective logical handshakes? Or equivalently, what are some necessary conditions that seem difficult to bring about, even for very smart software systems?
(My understanding of the term “logical handshake” is as a generalization of the technique from the Robust Cooperation paper. Something like “I have a model of the other relevant decision-makers, and I will enact my part of the joint policy $ϕ$ if I’m sufficiently confident that they’ll all enact their part of $ϕ$ .” Is that the sort of decision-procedure that seems likely to fall into commitment races?)

StrivingForLegibility Feb 2, 2024, 4:45 AM
1 point
0
in reply to: gwern’s comment on: Incorporating Mechanism Design Into Decision Theory
Thank you! I’m interested in checking out earlier chapters to make sure I understand the notation, but here’s my current understanding:
There are 7 axioms that go into Joyce’s representation theorem, and none of them seem to put any constraints on the set of actions available to the agent. So we should be able to ask a Joyce-rational agent to choose a policy for a game.
My impression of the representation theorem is that a formula like $E U (a) := \sum_{j = 1}^{N} P (a ↪ o_{j}; x) \cdot U (o_{j})$ can represent a variety of decision theories. Including ones like CDT which are dynamically inconsistent: they have a well-defined answer to “what do you think is the best policy”, and it’s not necessarily consistent with their answer to “what are you actually going to do?”
So it seems like the axioms are consistent with policy optimization, and they’re also consistent with action optimization. We can ask a decision theory to optimize a policy using an analogous expression: $E U (π) := \sum_{j = 1}^{N} P (π ↪ o_{j}; x) \cdot U (o_{j})$ .
It seems like we should be able to get a lot of leverage by imposing a consistency requirement that these two expressions line up. It shouldn’t matter whether we optimize over actions or policies, the actions taken should be the same.
I don’t expect that fully specifies how to calculate the counterfactual data structures $P (a ↪ o_{j}; x)$ and $P (π ↪ o_{j}; x)$ , even with Joyce’s other 7 axioms. But the first 7 didn’t rule out dynamic or counterfactual inconsistency, and this should at least narrow our search down to decision theories that are able to coordinate with themselves at other points in the game tree.

StrivingForLegibility Feb 2, 2024, 12:36 AM
3 points
0
in reply to: Charlie Steiner’s comment on: Incorporating Mechanism Design Into Decision Theory
Totally! The ecosystem I think you’re referring to is all of the programs which, when playing Chicken with each other, manage to play a correlated strategy somewhere on the Pareto frontier between (1,2) and (2,1).
Games like Chicken are actually what motivated me to think in terms of “collaborating to build mechanisms to reshape incentives.” If both players choose their mixed strategy separately, there’s an equilibrium where they independently mix ( $\frac{1}{3}$ , $\frac{2}{3}$ ) between Straight and Swerve respectively. But sometimes this leads to (Straight, Straight) or (Swerve, Swerve), leaving both players with an expected utility of $\frac{2}{3}$ and wishing they could coordinate on Something Else Which Is Not That.
If they could coordinate to build a traffic light, they could correlate their actions and only mix between (Straight, Swerve) and (Swerve, Straight). A ⁵⁰⁄₅₀ mix of these two gives each player an expected utility of 1.5, which seems pretty fair in terms of the payoffs achievable in this game.
Anything that’s mutually unpredictable and mutually observable can be use to correlate actions by different agents. Agents that can easily communicate can use cryptographic commitments to produce legibly fair correlated random signals.
My impression is that being able to perform logical handshakes creates program equilibria that can be better than any correlated equilibrium. When the traffic light says the joint strategy should be (Straight, Swerve), the player told to Swerve has an incentive to actually Swerve rather than go Straight, assuming the other player is going to be playing their part of the correlated equilibrium. But the same trick doesn’t work in the Prisoners’ Dilemma: a traffic light announcing (Cooperate, Cooperate) doesn’t give either player an incentive to actually play their part of that joint strategy. Whereas a logical handshake actually does reshape the players’ incentives: they each know that if they deviate from Cooperation, their counterpart will too, and they both prefer (Cooperate, Cooperate) to (Defect, Defect).
I haven’t found any results for the phrase “correlated program equilibrium”, but cousin_it talks about the setup here:
AIs that have access to each other’s code and common random bits can enforce any correlated play by using the quining trick from Re-formalizing PD. If they all agree beforehand that a certain outcome is “good and fair”, the trick allows them to “mutually precommit” to this outcome without at all constraining their ability to aggressively play against those who didn’t precommit. This leaves us with the problem of fairness.
This gives us the best of both worlds: the random bits can get us any distribution over joint strategies we want, and the logical handshake allows enforcement of that distribution so long as it’s better than each player’s BATNA. My impression is that it’s not always obvious what each player’s BATNA is, and in this sequence I recommend techniques like counterfactual mechanism networks to move the BATNA in directions that all players individually prefer and agree are fair.
But in the context of “delegating your decision to a computer program”, one reasonable starting BATNA might be “what would all delegates do if they couldn’t read each other’s source code?” A reasonable decision theory wouldn’t give in to inappropriate threats, and this removes the incentive for other decision theories to make them towards us in the first place. In the case of Chicken, the closed-source answer might be something like the mixed strategy we mentioned earlier: ( $\frac{1}{3}$ , $\frac{2}{3}$ ) mixture between Straight and Swerve.
Any logical negotiation needs to improve on this baseline. This can make it a lot easier for our decision theory to resist threats. Like in the next post, AliceBot can spin up an instance to negotiate with BobBot, and basically ignore the content of this negotiation. Negotiator AliceBot can credibly say to BobBot “look, regardless of what you threaten in this negotiation, take a look at my code. Implementer AliceBot won’t implement any policy that’s worse than the BATNA defined at that level.” And this extends recursively throughout the network, like if they perform multiple rounds of negotiation.

Counterfactual Mechanism Networks

StrivingForLegibilityJan 30, 2024, 8:30 PM

4 points

0 comments5 min readLW link

StrivingForLegibility Jan 28, 2024, 2:00 AM
1 point
0
in reply to: Carl Feynman’s comment on: To Boldly Code
I’d been thinking about “cleanness”, but I think you’re right that “being oriented to what we’re even talking about” is more important. Thank you again for the advice!

StrivingForLegibility Jan 27, 2024, 12:43 AM
4 points
2
in reply to: Carl Feynman’s comment on: To Boldly Code
Thank you! I started writing the previous post in this sequence and decided to break the example off into its own post.
For anyone else looking for a TLDR: this is an example of how a network of counterfactual mechanisms can be used to make logical commitments for an arbitrary game.

To Boldly Code

StrivingForLegibilityJan 26, 2024, 6:25 PM

25 points

4 comments3 min readLW link

Incorporating Mechanism Design Into Decision Theory

StrivingForLegibilityJan 26, 2024, 6:25 PM

17 points

4 comments4 min readLW link

StrivingForLegibility Jan 23, 2024, 3:41 AM
3 points
2
in reply to: Dagon’s comment on: Incorporating Justice Theory into Decision Theory
Totally! One of the most impressive results I’ve seen for one-shot games is the Robust Cooperation paper studying the open-source Prisoners’ Dilemma, where each player delegates their decision to a program that will learn the exact source code of the other delegate at runtime. Even utterly selfish agents have an incentive to delegate their decision to a program like FairBot or PrudentBot.
I think the probabilistic element helps to preserve expected utility in cases where the demands from each negotiator exceed the total amount of resources being bargained over. If each precommits to demand $60 when splitting $100, deterministic rejection leads to ($0, $0) with 100% probability. Whereas probabilistic rejection calls for the evaluator to accept with probability slightly less than $40/$60 $\approx$ 66.67%. Accepting leads to a payoff of ($60, $40), for an expected joint utility of slightly less than $\approx$ ($40, $26.67).
I think there are also totally situations where the asymmetrical power dynamics you’re talking about mean that one agent gets to dictate terms and the other gets what they get. Such as “Alice gets to unilaterally decide how $100 will be split, and Bob gets whatever Alice gives him.” In the one-shot version of this with selfish players, Alice just takes the $100 and Bob gets $0. Any hope for getting a selfish Alice to do anything else is going to come from incentives beyond this one interaction.
What links here?
- The Carnot Engine of Economics by StrivingForLegibility (Aug 9, 2024, 3:59 PM; 5 points)

Reframing Acausal Trolling as Acausal Patronage

StrivingForLegibilityJan 23, 2024, 3:04 AM

14 points

0 comments2 min readLW link

StrivingForLegibility Jan 23, 2024, 12:28 AM
1 point
0
in reply to: Dagon’s comment on: Incorporating Justice Theory into Decision Theory
My point is there’s a very tenuous jump from us making decisions to how/whether to enforce our preferences on others.
I think the big link I would point to is “politics/economics.” The spherical cows in a vacuum model of a modern democracy might be something like “a bunch of agents with different goals, that use voting as a consensus-building and standardization mechanism to decide what rules they want enforced, and contribute resources towards the costs of that enforcement.”
When it comes to notions of fairness, I think we agree that there is no single standard which applies in all domains in all places. I would frame it as an XKCD 927 situation, where there are multiple standards being applied in different jurisdictions, and within the same jurisdiction when it comes to different domains. (E.g. restitution vs damages.)
When it comes to a fungible resource like money or pie, I believe Yudkowsky’s take is “a fair split is an equal split of the resource itself.” One third each for three people deciding how to split a pie. There are well-defined extensions for different types of non-fungibility, and the type of “fairness” achieved seems to be domain-specific.
There are also results in game theory regarding “what does a good outcome for bargaining games look like?” These are also well-defined, and requiring different axioms leads to different bargaining solutions. My current favorite way of defining “fairness” for a bargaining game is the Kalai-Smorodinsky bargaining solution. At the meta-level I’m more confident about the attractive qualities of Yudkowsky’s probabilistic rejection model. Which includes working pretty well even when participants disagree about how to define “fairness”, and not giving anyone an incentive to exaggerate what they think is fair for them to receive. (Source might contain spoilers for Project Lawful but Yudkowsky describes the probabilistic rejection model here, and I discuss it more here.)
Applying Yudkowsky’s Algorithm to the labor scenario you described might look like having more fairness-oriented negotiations about “under what circumstances a worker can be fired”, “what compensation fired workers can expect to receive”, and “how much additional work can other workers be expected to perform without an increase in marginal compensation rate.” That negotiation might happen at the level of individual workers, unions, labor regulations, or a convoluted patchwork of those and more. I think historically we’ve made significant gains in defining and enforcing standards for things like fair wages and adequate safety.

StrivingForLegibility Jan 22, 2024, 6:12 PM
1 point
0
in reply to: Dagon’s comment on: Incorporating Justice Theory into Decision Theory
This might be a miscommunication, I meant something like “you and I individually might agree that some cost-cutting measures are good and some cost-cutting measures are bad.”
Agents probably also have an instrumental reason to coordinate on defining and enforcing standards for things like fair wages and adequate safety, where some agents might otherwise have an incentive to enrich themselves at the expense of others.

StrivingForLegibility Jan 22, 2024, 6:00 PM
1 point
0
in reply to: Richard_Kennaway’s comment on: Incorporating Justice Theory into Decision Theory
Oops, when I heard about it I’d gotten the impression that this had been adopted by at least one AI firm, even a minor one, but I also can’t find anything suggesting that’s the case. Thank you!
It looks like OpenAI has split into a nonprofit organization and a “capped-profit” company.
The fundamental idea of OpenAI LP is that investors and employees can get a capped return if we succeed at our mission, which allows us to raise investment capital and attract employees with startup-like equity. But any returns beyond that amount—and if we are successful, we expect to generate orders of magnitude more value than we’d owe to people who invest in or work at OpenAI LP—are owned by the original OpenAI Nonprofit entity.
OpenAI Nonprofit could act like the Future of Life Instutute’s proposed Windfall Trust, and a binding commitment to do so would be a Windfall Clause. They could also do something else prosocial with those profits, consistent with their nonprofit status.

StrivingForLegibility Jan 22, 2024, 4:37 PM
5 points
0
in reply to: Richard_Kennaway’s comment on: Incorporating Justice Theory into Decision Theory
I think we agree that in cases where competition is leading to good results, no change to the dynamics is called for.
We probably also agree on a lot of background value judgements like “when businesses become more competitive by spending less on things no one wants, like waste or pollution, that’s great!” And “when businesses become more competitive by spending less on things people want, like fair wages or adequate safety, that’s not great and intervention is called for.”
One case where we might literally want to distribute resources from the makers of a valuable product, to their competitors and society at large, is the development of Artificial General Intelligence (AGI). One of the big causes for concern here is that the natural dynamics might be winner-take-all, leading to an arms race that sacrifices spending on safety in favor of spending on increased capabilities or an earlier launch date.
If instead all AGI developers believed that the gains from AGI development would be spread out much more evenly, this might help to divert spending away from increasing capabilities and deploying as soon as possible, and towards making sure that deployment is done safely. ~~Many AI firms have already voluntarily signed~~ ~~Windfall Clauses, committing to share significant fractions of the wealth generated by successful AGI development.~~
EDIT: At the time of writing, it looks like Windfall Clauses have been advocated for but not adopted. Thank you Richard_Kennaway for the correction!

StrivingForLegibility Jan 22, 2024, 3:51 PM
5 points
0
in reply to: Dagon’s comment on: Incorporating Justice Theory into Decision Theory
For games without these mechanisms, the rational outcomes don’t end up that pleasant. Except sometimes, with players who have extra-rational motives.
I think we agree that if a selfish agent needs to be forced to not treat others poorly, in the absence of such enforcement they will treat others poorly.
It also seems like in many cases, selfish agents have an incentive to create exactly those mechanisms ensuring good outcomes for everyone, because it leads to good outcomes for them in particular. A nation-state comprised entirely of very selfish people would look a lot different from any modern country, but they face the same instrumental reasons to pool resources to enforce laws. The more inclined their populace is towards mistreatment in the absence of enforcement, the higher those enforcement costs need to be in order to achieve the same level of good treatment.
I also think “fairness” is a Schelling point that even selfish agents can agree to coordinate around, in a way that they could never be aligned on “maximizing Zaire’s utility in particular.” They don’t need to value fairness directly to agree that “an equal split of resources is the only compromise we’re all going to agree on during this negotiation.”
So I think my optimism comes from at least two places:
- Even utterly selfish agents still have an incentive to create mechanisms enforcing good outcomes for everyone.
- People have at least some altruism, and are willing to pay costs to prevent mistreatment of others in many cases.

StrivingForLegibility Jan 22, 2024, 12:03 AM
1 point
1
in reply to: Richard_Kennaway’s comment on: Incorporating Justice Theory into Decision Theory
That sounds reasonable to me! This could be another negative externality that we judge to be acceptable, and that we don’t want to internalize. Something like “if you break any of these rules, (e.g. worker safety, corporate espionage, etc.) then you owe the affected parties compensation. But as long as you follow the rules, there is no consensus-recognized debt.”

Incorporating Justice Theory into Decision Theory

StrivingForLegibilityJan 21, 2024, 7:17 PM

13 points

20 comments5 min readLW link

Legibility Makes Logical Line-Of-Sight Transitive

StrivingForLegibilityJan 19, 2024, 11:39 PM

13 points

0 comments5 min readLW link

Logical Line-Of-Sight Makes Games Sequential or Loopy

StrivingForLegibilityJan 19, 2024, 4:05 AM

40 points

0 comments7 min readLW link

In Strategic Time, Open-Source Games Are Loopy

StrivingForLegibilityJan 18, 2024, 12:08 AM

21 points

2 comments6 min readLW link