Vladimir_Nesov comments on Richard Ngo’s Shortform

Vladimir_Nesov Jun 29, 2024, 2:00 AM
2 points
0

Okay, but why isn’t this exactly the same as them just thinking to themselves “conditional on me taking action K, here’s the distribution over their actions” for each of N actions they could take, and then maximizing expected value?

The main trick with PD is that instead of an agent only having two possible actions C and D, we consider many programs the agent might self-modify into (commit to becoming) that each might in the end compute C or D. This effectively changes the action space, there are now many more possible actions. And these programs/actions can be given access (like quines, by their own construction) to initial source code of all the agents, allowed to reason about them. But then programs have logical uncertainty about how they in the end behave, so the things you’d be enumerating don’t immediately cash out in expected values. And these programs can decide to cause different expected values depending of what you’ll do with their behavior, anticipate how you reason about them through reasoning about you in turn. It’s hard to find clear arguments for why any particular desirable thing could happen as a result of this setup.

UDT is notable for being one way of making this work. The “open source game theory” of PD (through Löb’s theorem, modal fixpoints, Payor’s lemma) pinpoints some cases where it’s possible to say that we get cooperation in PD. But in general it’s proven difficult to say anything both meaningful and flexible about this seemingly in-broad-strokes-inevitable setup, in particular for agents with different values that are doing more general things than playing PD.

(The following relies a little bit on motivation given in the other comment.)

When both $A$ and $B$ consider listening to a shared subagent $C$ , subagent $C$ is itself considering what it should be doing, depending on what $A$ and $B$ do with $C$ ‘s behavior. So for example with $A$ there are two stages of computation to consider: first, it was $A$ and didn’t yet decide to sign the contract, then it became a composite system $P (C)$ , where $P$ is $A$ ’s policy for giving influence to C’s behavior (possibly $P$ and $A$ include a larger part of the world where the first agent exists, not just the agent itself). The commitment of $A$ is to the truth of the equality $A = P (C)$ , which gives $C$ influence over the computational consequences of $A$ in the particular shape $P$ . The trick with the logical time of this process is that $C$ should be able to know (something about) $P$ updatelessly, without being shown observations of what it is, so that the instance of $C$ within $B$ would also know of $P$ and be able to take it into account in choosing its joint policy that acts both through $A$ and $B$ . (Of course, the same is happening within $B$ .)

This sketch frames decision making without directly appealing to consequentialism. Here, $A$ controls $B$ through the incentives $P$ it creates for $C$ (a particular way in which $C$ gets to project influence from $A$ ‘s place in the world), where $C$ also has influence over $B$ . So $A$ doesn’t seek to manipulate $B$ directly by considering the consequences for $B$ ’s behavior of various ways that $A$ might behave.