if the past agent doesn’t try to wade in the murky waters of logical updatelessness, it’s not really dumber or more fallible to trickery, it can see everything the way a universal Turing machine or Solomonoff induction can “see everything”.
I actually think it might still be more fallible, for a couple of reasons.
Firstly, consider an agent which, at time T, respects all commitments it would have made at times up to T. Now if you’re trying to attack the agent at time T, you have T different versions of it that you can attack, and if any of them makes a dumb commitment then you win.
I guess you could account for this by just gradually increasing the threshold for making commitments over time, though.
Secondly: the further back you go, the more farsighted the past agent needs to be about the consequences of its commitments. If you have any compounding mistakes in the way it expects things to play out, then it’ll just get worse and worse the further back you defer.
Again, I guess you could account for this by having a higher threshold for making commitments which you expect to benefit you further down the line.
Then, re logical updatelessness: it feels like in the long term we need to unify logical + empirical updates, because they’re roughly the same type of thinking. Murky waters perhaps, but necessary ones.
Though a synthesis between coordination of agents with different values and UDT (recognizing Schelling point contracts as a central construction) is long overdue.
Yeah, so what could this look like? I think one important idea is that you don’t have to be deferring to your past self, it’s just that your past self is the clearest Schelling point. But it wouldn’t be crazy for me to, say, use BDT: Buddha Decision Theory, in which I obey all commitments that the Buddha would have made for me if he’d been told about my situation. The differences between me using UDT and BDT (tentatively) seem only qualitative to me, not quantitative. BDT makes it harder for me to cooperate with hypothetical copies of myself who hadn’t yet thought of BDT (because “Buddha” is less of a Schelling point amongst copies of myself than “past Richard”). It also makes me worse off than UDT in some cases, because sometimes the Buddha would make commitments in favor of his interests, not mine. But it also makes it a bit easier for me to cooperate with others, who might also converge to BDT.
At this point I’m starting to suspect that solving UDT 2 is not just alignment-complete, it’s also politics- and sociology-complete. The real question is whether we can isolate toy examples or problems in which these ideas can be formalized, rather than just having them remain vague “what if everyone got along” speculation.
UDT doesn’t do multistage commitments, it has a single all-powerful “past” version that looks into all possible futures before pronouncing a global policy that all of them would then follow. This policy is not a collection of commitments in a reasonable informal sense, it’s literally all details of behavior of future versions of the agent in response to all possible observations. In case of logical updatelessness, also in response to all possible observations of computational facts. (UDT for the idealized past version defines a single master model, future versions are just passively running inference from the contexts of their particular situations.)
The convergent idea for acausal coordination between systems A and B seems to be constructing a shared subagent C whose instances exist as part of both A and B (afterA and B successfully both construct the same C, not before), so that C can then act within them in the style of FDT, though really it’s mostly about C thinking of the effects of its behavior in terms of “I am an algorithm” rather than “I am a physical object”. (For UDT, the shared subagent C is the idealized common past version of its different possible future versions A and B. This assumes that A and B already have a lot in common, so maybe C is instead Buddha.)
A bulk of the blind alleys seem to be about allowing subagents various superpowers, instead of focusing on managing the fallout of making them small and bounded (but possibly more plentiful). I think this is where investigations into logical updatelessness go wrong. It does need solving, but not by considering some fact unknown globally, or even at certain logical times. Instead a fact can remain unknown to some small subagent, and can be observed by it at some point, or computed by another subagent. Values are also knowledge, so sufficiently small subagents shouldn’t even by default know full values of the larger system, and should be prepared to learn more about them. This is a consideration that doesn’t even depend on there initially being multiple big agents with different values.
Another point is that coordination doesn’t necessarily need construction of exactly the same shared subagent, or it doesn’t need to be “exactly the same” in a straightforward sense, which the results on coordination in PD illustrate. The role of subagents in this case is that A can create a subagent CA, while B creates a subagent CB. And even where A and B remain intractable for each other, CA and CB can be much smaller and by construction prepared to coordinate with each other, from within A and B. (It seems natural for the big agents to treat such subagents as something like their copies of an assurance contract, which is signed through commitment to give them influence over the big agent’s thinking or behavior. And letting contracts be agents in their own right gives a lot of flexibility in coordination they can arrange.)
Okay, so trying to combine Prisoner’s dilemma and UDT, we get: A and B are in a prisoner’s dilemma. Suppose they have a list of N agents (which include, say, A’s past self, B’s past self, the Buddha, etc), and they each must commit to following one of those agent’s instructions. Each of them estimates: “conditional on me committing to listen to agent K, here’s a distribution over which agent they’d commit to listen to” And then you maximize expected value based on that.
Okay, but why isn’t this exactly the same as them just thinking to themselves “conditional on me taking action K, here’s the distribution over their actions” for each of N actions they could take, and then maximizing expected value? It feels like the difference is that it’s really hard to actually reason about the correlations between my low-level actions and your low-level actions, whereas it might be easier to reason about the correlations between my high-level commitments and your high-level commitments.
I.e. the role of the Buddha in this situation is just to make the acausal coordination here much easier.
Okay, but why isn’t this exactly the same as them just thinking to themselves “conditional on me taking action K, here’s the distribution over their actions” for each of N actions they could take, and then maximizing expected value?
The main trick with PD is that instead of an agent only having two possible actions C and D, we consider many programs the agent might self-modify into (commit to becoming) that each might in the end compute C or D. This effectively changes the action space, there are now many more possible actions. And these programs/actions can be given access (like quines, by their own construction) to initial source code of all the agents, allowed to reason about them. But then programs have logical uncertainty about how they in the end behave, so the things you’d be enumerating don’t immediately cash out in expected values. And these programs can decide to cause different expected values depending of what you’ll do with their behavior, anticipate how you reason about them through reasoning about you in turn. It’s hard to find clear arguments for why any particular desirable thing could happen as a result of this setup.
UDT is notable for being one way of making this work. The “open source game theory” of PD (through Löb’s theorem, modal fixpoints, Payor’s lemma) pinpoints some cases where it’s possible to say that we get cooperation in PD. But in general it’s proven difficult to say anything both meaningful and flexible about this seemingly in-broad-strokes-inevitable setup, in particular for agents with different values that are doing more general things than playing PD.
When both A and B consider listening to a shared subagent C, subagent C is itself considering what it should be doing, depending on what A and B do with C‘s behavior. So for example with A there are two stages of computation to consider: first, it was A and didn’t yet decide to sign the contract, then it became a composite system P(C), where P is A’s policy for giving influence to C’s behavior (possibly P and A include a larger part of the world where the first agent exists, not just the agent itself). The commitment of A is to the truth of the equality A=P(C), which gives C influence over the computational consequences of A in the particular shape P. The trick with the logical time of this process is that C should be able to know (something about) P updatelessly, without being shown observations of what it is, so that the instance of C within B would also know of P and be able to take it into account in choosing its joint policy that acts both through A and B. (Of course, the same is happening within B.)
This sketch frames decision making without directly appealing to consequentialism. Here, A controls B through the incentivesP it creates for C (a particular way in which C gets to project influence from A‘s place in the world), where C also has influence over B. So A doesn’t seek to manipulate B directly by considering the consequences for B’s behavior of various ways that A might behave.
I actually think it might still be more fallible, for a couple of reasons.
Firstly, consider an agent which, at time T, respects all commitments it would have made at times up to T. Now if you’re trying to attack the agent at time T, you have T different versions of it that you can attack, and if any of them makes a dumb commitment then you win.
I guess you could account for this by just gradually increasing the threshold for making commitments over time, though.
Secondly: the further back you go, the more farsighted the past agent needs to be about the consequences of its commitments. If you have any compounding mistakes in the way it expects things to play out, then it’ll just get worse and worse the further back you defer.
Again, I guess you could account for this by having a higher threshold for making commitments which you expect to benefit you further down the line.
Then, re logical updatelessness: it feels like in the long term we need to unify logical + empirical updates, because they’re roughly the same type of thinking. Murky waters perhaps, but necessary ones.
Yeah, so what could this look like? I think one important idea is that you don’t have to be deferring to your past self, it’s just that your past self is the clearest Schelling point. But it wouldn’t be crazy for me to, say, use BDT: Buddha Decision Theory, in which I obey all commitments that the Buddha would have made for me if he’d been told about my situation. The differences between me using UDT and BDT (tentatively) seem only qualitative to me, not quantitative. BDT makes it harder for me to cooperate with hypothetical copies of myself who hadn’t yet thought of BDT (because “Buddha” is less of a Schelling point amongst copies of myself than “past Richard”). It also makes me worse off than UDT in some cases, because sometimes the Buddha would make commitments in favor of his interests, not mine. But it also makes it a bit easier for me to cooperate with others, who might also converge to BDT.
At this point I’m starting to suspect that solving UDT 2 is not just alignment-complete, it’s also politics- and sociology-complete. The real question is whether we can isolate toy examples or problems in which these ideas can be formalized, rather than just having them remain vague “what if everyone got along” speculation.
UDT doesn’t do multistage commitments, it has a single all-powerful “past” version that looks into all possible futures before pronouncing a global policy that all of them would then follow. This policy is not a collection of commitments in a reasonable informal sense, it’s literally all details of behavior of future versions of the agent in response to all possible observations. In case of logical updatelessness, also in response to all possible observations of computational facts. (UDT for the idealized past version defines a single master model, future versions are just passively running inference from the contexts of their particular situations.)
The convergent idea for acausal coordination between systems A and B seems to be constructing a shared subagent C whose instances exist as part of both A and B (after A and B successfully both construct the same C, not before), so that C can then act within them in the style of FDT, though really it’s mostly about C thinking of the effects of its behavior in terms of “I am an algorithm” rather than “I am a physical object”. (For UDT, the shared subagent C is the idealized common past version of its different possible future versions A and B. This assumes that A and B already have a lot in common, so maybe C is instead Buddha.)
A bulk of the blind alleys seem to be about allowing subagents various superpowers, instead of focusing on managing the fallout of making them small and bounded (but possibly more plentiful). I think this is where investigations into logical updatelessness go wrong. It does need solving, but not by considering some fact unknown globally, or even at certain logical times. Instead a fact can remain unknown to some small subagent, and can be observed by it at some point, or computed by another subagent. Values are also knowledge, so sufficiently small subagents shouldn’t even by default know full values of the larger system, and should be prepared to learn more about them. This is a consideration that doesn’t even depend on there initially being multiple big agents with different values.
Another point is that coordination doesn’t necessarily need construction of exactly the same shared subagent, or it doesn’t need to be “exactly the same” in a straightforward sense, which the results on coordination in PD illustrate. The role of subagents in this case is that A can create a subagent CA, while B creates a subagent CB. And even where A and B remain intractable for each other, CA and CB can be much smaller and by construction prepared to coordinate with each other, from within A and B. (It seems natural for the big agents to treat such subagents as something like their copies of an assurance contract, which is signed through commitment to give them influence over the big agent’s thinking or behavior. And letting contracts be agents in their own right gives a lot of flexibility in coordination they can arrange.)
Okay, so trying to combine Prisoner’s dilemma and UDT, we get: A and B are in a prisoner’s dilemma. Suppose they have a list of N agents (which include, say, A’s past self, B’s past self, the Buddha, etc), and they each must commit to following one of those agent’s instructions. Each of them estimates: “conditional on me committing to listen to agent K, here’s a distribution over which agent they’d commit to listen to” And then you maximize expected value based on that.
Okay, but why isn’t this exactly the same as them just thinking to themselves “conditional on me taking action K, here’s the distribution over their actions” for each of N actions they could take, and then maximizing expected value? It feels like the difference is that it’s really hard to actually reason about the correlations between my low-level actions and your low-level actions, whereas it might be easier to reason about the correlations between my high-level commitments and your high-level commitments.
I.e. the role of the Buddha in this situation is just to make the acausal coordination here much easier.
The main trick with PD is that instead of an agent only having two possible actions C and D, we consider many programs the agent might self-modify into (commit to becoming) that each might in the end compute C or D. This effectively changes the action space, there are now many more possible actions. And these programs/actions can be given access (like quines, by their own construction) to initial source code of all the agents, allowed to reason about them. But then programs have logical uncertainty about how they in the end behave, so the things you’d be enumerating don’t immediately cash out in expected values. And these programs can decide to cause different expected values depending of what you’ll do with their behavior, anticipate how you reason about them through reasoning about you in turn. It’s hard to find clear arguments for why any particular desirable thing could happen as a result of this setup.
UDT is notable for being one way of making this work. The “open source game theory” of PD (through Löb’s theorem, modal fixpoints, Payor’s lemma) pinpoints some cases where it’s possible to say that we get cooperation in PD. But in general it’s proven difficult to say anything both meaningful and flexible about this seemingly in-broad-strokes-inevitable setup, in particular for agents with different values that are doing more general things than playing PD.
(The following relies a little bit on motivation given in the other comment.)
When both A and B consider listening to a shared subagent C, subagent C is itself considering what it should be doing, depending on what A and B do with C‘s behavior. So for example with A there are two stages of computation to consider: first, it was A and didn’t yet decide to sign the contract, then it became a composite system P(C), where P is A’s policy for giving influence to C’s behavior (possibly P and A include a larger part of the world where the first agent exists, not just the agent itself). The commitment of A is to the truth of the equality A=P(C), which gives C influence over the computational consequences of A in the particular shape P. The trick with the logical time of this process is that C should be able to know (something about) P updatelessly, without being shown observations of what it is, so that the instance of C within B would also know of P and be able to take it into account in choosing its joint policy that acts both through A and B. (Of course, the same is happening within B.)
This sketch frames decision making without directly appealing to consequentialism. Here, A controls B through the incentives P it creates for C (a particular way in which C gets to project influence from A‘s place in the world), where C also has influence over B. So A doesn’t seek to manipulate B directly by considering the consequences for B’s behavior of various ways that A might behave.