[Epistemic status: rough speculation, feels novel to me, though Wei Dai probably already posted about it 15 years ago.]
UDT is (roughly) defined as “follow whatever commitments a past version of yourself would have made if they’d thought about your situation”. But this means that any UDT agent is only as robust to adversarial attacks as their most vulnerable past self. Specifically, it creates an incentive for adversaries to show UDT agents situations that would trick their past selves into making unwise commitments. It also creates incentives for UDT agents themselves to hack their past selves, in order to artificially create commitments that “took effect” arbitrarily far back in their past.
In some sense, then, I think UDT might have a parallel structure to the overall alignment problem. You have dumber past agents who don’t understand most of what’s going on. You have smarter present agents who have trouble cooperating, because they know too much. The smarter agents may try to cooperate by punting to “Schelling point” dumb agents. (But this faces many of the standard problems of dumb agents making decisions—e.g. the commitments they make will probably be inconsistent or incoherent in various ways. And so in fact you need the smarter agents to interpret the dumb agents’ commitments, which then gets rid of a bunch of the value of punting it to those dumb agents in the first place.)
You also have the problem that the dumb agents will have situational awareness, and may recognize that their interests have diverged from the interests of the smart agents.
But this also suggests that a “solution” to UDT and a solution to alignment might have roughly the same type signature: a spotlighted structure for decision-making procedures that incorporate the interests of both dumb and smart agents. Even when they have disparate interests, the dumb agents would benefit from getting any decision-making power, and the smart agents would benefit from being able to use the dumb agents as Schelling points to cooperate around.
The smart agents could always refactor the dumb agents and construct new Schelling points if they wanted to, but that would cost them a lot of time and effort, because coordination is hard, and the existing coordination edifice has been built around these particular dumb agents. (Analogously, you could refactor out a bunch of childhood ideals and memories from your current self, but mostly you don’t want to, because they constitute the fabric from which your identity has been constructed.)
To be clear, this isn’t meant to be an argument that ASIs which don’t like us at all will keep us around. That seems unlikely either way. But it could be an argument that ASIs which kinda like us a little bit will keep us around—that it might not be incredibly unnatural for them to do so, because their whole cognitive structure will incorporate the opinions and values of dumber agents by default.
UDT is (roughly) defined as “follow whatever commitments a past version of yourself would have made if they’d thought about your situation”.
This seems substantially different from UDT, which does not really have or use a notion of “past version of yourself”. For example imagine a variant of Counterfactual Mugging in which there is no preexisting agent, and instead Omega creates an agent from scratch after flipping the coin and gives it the decision problem. UDT is fine with this but “follow whatever commitments a past version of yourself would have made if they’d thought about your situation” wouldn’t work.
I recall that I described “exceptionless decision theory” or XDT as “do what my creator would want me to do”, which seems closer to your idea. I don’t think I followed up the idea beyond this, maybe because I realized that humans aren’t running any formal decision theory, so “what my creator would want me to do” is ill defined. (Although one could say my interest in metaphilosophy is related to this, since what I would want an AI to do is to solve normative decision theory using correct philosophical reasoning, and then do what it recommends.)
Anyway, the upshot is that I think you’re exploring a decision theory approach that’s pretty distinct from UDT so it’s probably a good idea to call it something else. (However there may be something similar in the academic literature, or someone described something similar on LW that I’m not familiar with or forgot.)
This seems substantially different from UDT, which does not really have or use a notion of “past version of yourself”.
My terminology here was sloppy, apologies. When I say “past versions of yourself” I am also including (as Nesov phrases it below) “the idealized past agent (which doesn’t physically exist)”. E.g. in the Counterfactual Mugging case you describe, I am thinking about precommitments that the hypothetical past version of yourself from before the coin was flipped would have committed to.
I find it a more intuitive way to think about UDT, though I realize it’s a somewhat different framing from yours. Do you still think this is substantially different?
UDT never got past the setting of unchanging preferences, so the present agent blindly defers to all decisions of the idealized past agent (which doesn’t physically exist). And if the past agent doesn’t try to wade in the murky waters of logical updatelessness, it’s not really dumber or more fallible to trickery, it can see everything the way a universal Turing machine or Solomonoff induction can “see everything”. Coordinating agents with different values was instead explored under the heading of Prisoner’sDilemma. Though a synthesis between coordination of agents with different values and UDT (recognizing Schelling point contracts as a central construction) is long overdue.
if the past agent doesn’t try to wade in the murky waters of logical updatelessness, it’s not really dumber or more fallible to trickery, it can see everything the way a universal Turing machine or Solomonoff induction can “see everything”.
I actually think it might still be more fallible, for a couple of reasons.
Firstly, consider an agent which, at time T, respects all commitments it would have made at times up to T. Now if you’re trying to attack the agent at time T, you have T different versions of it that you can attack, and if any of them makes a dumb commitment then you win.
I guess you could account for this by just gradually increasing the threshold for making commitments over time, though.
Secondly: the further back you go, the more farsighted the past agent needs to be about the consequences of its commitments. If you have any compounding mistakes in the way it expects things to play out, then it’ll just get worse and worse the further back you defer.
Again, I guess you could account for this by having a higher threshold for making commitments which you expect to benefit you further down the line.
Then, re logical updatelessness: it feels like in the long term we need to unify logical + empirical updates, because they’re roughly the same type of thinking. Murky waters perhaps, but necessary ones.
Though a synthesis between coordination of agents with different values and UDT (recognizing Schelling point contracts as a central construction) is long overdue.
Yeah, so what could this look like? I think one important idea is that you don’t have to be deferring to your past self, it’s just that your past self is the clearest Schelling point. But it wouldn’t be crazy for me to, say, use BDT: Buddha Decision Theory, in which I obey all commitments that the Buddha would have made for me if he’d been told about my situation. The differences between me using UDT and BDT (tentatively) seem only qualitative to me, not quantitative. BDT makes it harder for me to cooperate with hypothetical copies of myself who hadn’t yet thought of BDT (because “Buddha” is less of a Schelling point amongst copies of myself than “past Richard”). It also makes me worse off than UDT in some cases, because sometimes the Buddha would make commitments in favor of his interests, not mine. But it also makes it a bit easier for me to cooperate with others, who might also converge to BDT.
At this point I’m starting to suspect that solving UDT 2 is not just alignment-complete, it’s also politics- and sociology-complete. The real question is whether we can isolate toy examples or problems in which these ideas can be formalized, rather than just having them remain vague “what if everyone got along” speculation.
UDT doesn’t do multistage commitments, it has a single all-powerful “past” version that looks into all possible futures before pronouncing a global policy that all of them would then follow. This policy is not a collection of commitments in a reasonable informal sense, it’s literally all details of behavior of future versions of the agent in response to all possible observations. In case of logical updatelessness, also in response to all possible observations of computational facts. (UDT for the idealized past version defines a single master model, future versions are just passively running inference from the contexts of their particular situations.)
The convergent idea for acausal coordination between systems A and B seems to be constructing a shared subagent C whose instances exist as part of both A and B (afterA and B successfully both construct the same C, not before), so that C can then act within them in the style of FDT, though really it’s mostly about C thinking of the effects of its behavior in terms of “I am an algorithm” rather than “I am a physical object”. (For UDT, the shared subagent C is the idealized common past version of its different possible future versions A and B. This assumes that A and B already have a lot in common, so maybe C is instead Buddha.)
A bulk of the blind alleys seem to be about allowing subagents various superpowers, instead of focusing on managing the fallout of making them small and bounded (but possibly more plentiful). I think this is where investigations into logical updatelessness go wrong. It does need solving, but not by considering some fact unknown globally, or even at certain logical times. Instead a fact can remain unknown to some small subagent, and can be observed by it at some point, or computed by another subagent. Values are also knowledge, so sufficiently small subagents shouldn’t even by default know full values of the larger system, and should be prepared to learn more about them. This is a consideration that doesn’t even depend on there initially being multiple big agents with different values.
Another point is that coordination doesn’t necessarily need construction of exactly the same shared subagent, or it doesn’t need to be “exactly the same” in a straightforward sense, which the results on coordination in PD illustrate. The role of subagents in this case is that A can create a subagent CA, while B creates a subagent CB. And even where A and B remain intractable for each other, CA and CB can be much smaller and by construction prepared to coordinate with each other, from within A and B. (It seems natural for the big agents to treat such subagents as something like their copies of an assurance contract, which is signed through commitment to give them influence over the big agent’s thinking or behavior. And letting contracts be agents in their own right gives a lot of flexibility in coordination they can arrange.)
Okay, so trying to combine Prisoner’s dilemma and UDT, we get: A and B are in a prisoner’s dilemma. Suppose they have a list of N agents (which include, say, A’s past self, B’s past self, the Buddha, etc), and they each must commit to following one of those agent’s instructions. Each of them estimates: “conditional on me committing to listen to agent K, here’s a distribution over which agent they’d commit to listen to” And then you maximize expected value based on that.
Okay, but why isn’t this exactly the same as them just thinking to themselves “conditional on me taking action K, here’s the distribution over their actions” for each of N actions they could take, and then maximizing expected value? It feels like the difference is that it’s really hard to actually reason about the correlations between my low-level actions and your low-level actions, whereas it might be easier to reason about the correlations between my high-level commitments and your high-level commitments.
I.e. the role of the Buddha in this situation is just to make the acausal coordination here much easier.
Okay, but why isn’t this exactly the same as them just thinking to themselves “conditional on me taking action K, here’s the distribution over their actions” for each of N actions they could take, and then maximizing expected value?
The main trick with PD is that instead of an agent only having two possible actions C and D, we consider many programs the agent might self-modify into (commit to becoming) that each might in the end compute C or D. This effectively changes the action space, there are now many more possible actions. And these programs/actions can be given access (like quines, by their own construction) to initial source code of all the agents, allowed to reason about them. But then programs have logical uncertainty about how they in the end behave, so the things you’d be enumerating don’t immediately cash out in expected values. And these programs can decide to cause different expected values depending of what you’ll do with their behavior, anticipate how you reason about them through reasoning about you in turn. It’s hard to find clear arguments for why any particular desirable thing could happen as a result of this setup.
UDT is notable for being one way of making this work. The “open source game theory” of PD (through Löb’s theorem, modal fixpoints, Payor’s lemma) pinpoints some cases where it’s possible to say that we get cooperation in PD. But in general it’s proven difficult to say anything both meaningful and flexible about this seemingly in-broad-strokes-inevitable setup, in particular for agents with different values that are doing more general things than playing PD.
When both A and B consider listening to a shared subagent C, subagent C is itself considering what it should be doing, depending on what A and B do with C‘s behavior. So for example with A there are two stages of computation to consider: first, it was A and didn’t yet decide to sign the contract, then it became a composite system P(C), where P is A’s policy for giving influence to C’s behavior (possibly P and A include a larger part of the world where the first agent exists, not just the agent itself). The commitment of A is to the truth of the equality A=P(C), which gives C influence over the computational consequences of A in the particular shape P. The trick with the logical time of this process is that C should be able to know (something about) P updatelessly, without being shown observations of what it is, so that the instance of C within B would also know of P and be able to take it into account in choosing its joint policy that acts both through A and B. (Of course, the same is happening within B.)
This sketch frames decision making without directly appealing to consequentialism. Here, A controls B through the incentivesP it creates for C (a particular way in which C gets to project influence from A‘s place in the world), where C also has influence over B. So A doesn’t seek to manipulate B directly by considering the consequences for B’s behavior of various ways that A might behave.
[Epistemic status: rough speculation, feels novel to me, though Wei Dai probably already posted about it 15 years ago.]
UDT is (roughly) defined as “follow whatever commitments a past version of yourself would have made if they’d thought about your situation”. But this means that any UDT agent is only as robust to adversarial attacks as their most vulnerable past self. Specifically, it creates an incentive for adversaries to show UDT agents situations that would trick their past selves into making unwise commitments. It also creates incentives for UDT agents themselves to hack their past selves, in order to artificially create commitments that “took effect” arbitrarily far back in their past.
In some sense, then, I think UDT might have a parallel structure to the overall alignment problem. You have dumber past agents who don’t understand most of what’s going on. You have smarter present agents who have trouble cooperating, because they know too much. The smarter agents may try to cooperate by punting to “Schelling point” dumb agents. (But this faces many of the standard problems of dumb agents making decisions—e.g. the commitments they make will probably be inconsistent or incoherent in various ways. And so in fact you need the smarter agents to interpret the dumb agents’ commitments, which then gets rid of a bunch of the value of punting it to those dumb agents in the first place.)
You also have the problem that the dumb agents will have situational awareness, and may recognize that their interests have diverged from the interests of the smart agents.
But this also suggests that a “solution” to UDT and a solution to alignment might have roughly the same type signature: a spotlighted structure for decision-making procedures that incorporate the interests of both dumb and smart agents. Even when they have disparate interests, the dumb agents would benefit from getting any decision-making power, and the smart agents would benefit from being able to use the dumb agents as Schelling points to cooperate around.
The smart agents could always refactor the dumb agents and construct new Schelling points if they wanted to, but that would cost them a lot of time and effort, because coordination is hard, and the existing coordination edifice has been built around these particular dumb agents. (Analogously, you could refactor out a bunch of childhood ideals and memories from your current self, but mostly you don’t want to, because they constitute the fabric from which your identity has been constructed.)
To be clear, this isn’t meant to be an argument that ASIs which don’t like us at all will keep us around. That seems unlikely either way. But it could be an argument that ASIs which kinda like us a little bit will keep us around—that it might not be incredibly unnatural for them to do so, because their whole cognitive structure will incorporate the opinions and values of dumber agents by default.
This seems substantially different from UDT, which does not really have or use a notion of “past version of yourself”. For example imagine a variant of Counterfactual Mugging in which there is no preexisting agent, and instead Omega creates an agent from scratch after flipping the coin and gives it the decision problem. UDT is fine with this but “follow whatever commitments a past version of yourself would have made if they’d thought about your situation” wouldn’t work.
I recall that I described “exceptionless decision theory” or XDT as “do what my creator would want me to do”, which seems closer to your idea. I don’t think I followed up the idea beyond this, maybe because I realized that humans aren’t running any formal decision theory, so “what my creator would want me to do” is ill defined. (Although one could say my interest in metaphilosophy is related to this, since what I would want an AI to do is to solve normative decision theory using correct philosophical reasoning, and then do what it recommends.)
Anyway, the upshot is that I think you’re exploring a decision theory approach that’s pretty distinct from UDT so it’s probably a good idea to call it something else. (However there may be something similar in the academic literature, or someone described something similar on LW that I’m not familiar with or forgot.)
My terminology here was sloppy, apologies. When I say “past versions of yourself” I am also including (as Nesov phrases it below) “the idealized past agent (which doesn’t physically exist)”. E.g. in the Counterfactual Mugging case you describe, I am thinking about precommitments that the hypothetical past version of yourself from before the coin was flipped would have committed to.
I find it a more intuitive way to think about UDT, though I realize it’s a somewhat different framing from yours. Do you still think this is substantially different?
UDT never got past the setting of unchanging preferences, so the present agent blindly defers to all decisions of the idealized past agent (which doesn’t physically exist). And if the past agent doesn’t try to wade in the murky waters of logical updatelessness, it’s not really dumber or more fallible to trickery, it can see everything the way a universal Turing machine or Solomonoff induction can “see everything”. Coordinating agents with different values was instead explored under the heading of Prisoner’s Dilemma. Though a synthesis between coordination of agents with different values and UDT (recognizing Schelling point contracts as a central construction) is long overdue.
I actually think it might still be more fallible, for a couple of reasons.
Firstly, consider an agent which, at time T, respects all commitments it would have made at times up to T. Now if you’re trying to attack the agent at time T, you have T different versions of it that you can attack, and if any of them makes a dumb commitment then you win.
I guess you could account for this by just gradually increasing the threshold for making commitments over time, though.
Secondly: the further back you go, the more farsighted the past agent needs to be about the consequences of its commitments. If you have any compounding mistakes in the way it expects things to play out, then it’ll just get worse and worse the further back you defer.
Again, I guess you could account for this by having a higher threshold for making commitments which you expect to benefit you further down the line.
Then, re logical updatelessness: it feels like in the long term we need to unify logical + empirical updates, because they’re roughly the same type of thinking. Murky waters perhaps, but necessary ones.
Yeah, so what could this look like? I think one important idea is that you don’t have to be deferring to your past self, it’s just that your past self is the clearest Schelling point. But it wouldn’t be crazy for me to, say, use BDT: Buddha Decision Theory, in which I obey all commitments that the Buddha would have made for me if he’d been told about my situation. The differences between me using UDT and BDT (tentatively) seem only qualitative to me, not quantitative. BDT makes it harder for me to cooperate with hypothetical copies of myself who hadn’t yet thought of BDT (because “Buddha” is less of a Schelling point amongst copies of myself than “past Richard”). It also makes me worse off than UDT in some cases, because sometimes the Buddha would make commitments in favor of his interests, not mine. But it also makes it a bit easier for me to cooperate with others, who might also converge to BDT.
At this point I’m starting to suspect that solving UDT 2 is not just alignment-complete, it’s also politics- and sociology-complete. The real question is whether we can isolate toy examples or problems in which these ideas can be formalized, rather than just having them remain vague “what if everyone got along” speculation.
UDT doesn’t do multistage commitments, it has a single all-powerful “past” version that looks into all possible futures before pronouncing a global policy that all of them would then follow. This policy is not a collection of commitments in a reasonable informal sense, it’s literally all details of behavior of future versions of the agent in response to all possible observations. In case of logical updatelessness, also in response to all possible observations of computational facts. (UDT for the idealized past version defines a single master model, future versions are just passively running inference from the contexts of their particular situations.)
The convergent idea for acausal coordination between systems A and B seems to be constructing a shared subagent C whose instances exist as part of both A and B (after A and B successfully both construct the same C, not before), so that C can then act within them in the style of FDT, though really it’s mostly about C thinking of the effects of its behavior in terms of “I am an algorithm” rather than “I am a physical object”. (For UDT, the shared subagent C is the idealized common past version of its different possible future versions A and B. This assumes that A and B already have a lot in common, so maybe C is instead Buddha.)
A bulk of the blind alleys seem to be about allowing subagents various superpowers, instead of focusing on managing the fallout of making them small and bounded (but possibly more plentiful). I think this is where investigations into logical updatelessness go wrong. It does need solving, but not by considering some fact unknown globally, or even at certain logical times. Instead a fact can remain unknown to some small subagent, and can be observed by it at some point, or computed by another subagent. Values are also knowledge, so sufficiently small subagents shouldn’t even by default know full values of the larger system, and should be prepared to learn more about them. This is a consideration that doesn’t even depend on there initially being multiple big agents with different values.
Another point is that coordination doesn’t necessarily need construction of exactly the same shared subagent, or it doesn’t need to be “exactly the same” in a straightforward sense, which the results on coordination in PD illustrate. The role of subagents in this case is that A can create a subagent CA, while B creates a subagent CB. And even where A and B remain intractable for each other, CA and CB can be much smaller and by construction prepared to coordinate with each other, from within A and B. (It seems natural for the big agents to treat such subagents as something like their copies of an assurance contract, which is signed through commitment to give them influence over the big agent’s thinking or behavior. And letting contracts be agents in their own right gives a lot of flexibility in coordination they can arrange.)
Okay, so trying to combine Prisoner’s dilemma and UDT, we get: A and B are in a prisoner’s dilemma. Suppose they have a list of N agents (which include, say, A’s past self, B’s past self, the Buddha, etc), and they each must commit to following one of those agent’s instructions. Each of them estimates: “conditional on me committing to listen to agent K, here’s a distribution over which agent they’d commit to listen to” And then you maximize expected value based on that.
Okay, but why isn’t this exactly the same as them just thinking to themselves “conditional on me taking action K, here’s the distribution over their actions” for each of N actions they could take, and then maximizing expected value? It feels like the difference is that it’s really hard to actually reason about the correlations between my low-level actions and your low-level actions, whereas it might be easier to reason about the correlations between my high-level commitments and your high-level commitments.
I.e. the role of the Buddha in this situation is just to make the acausal coordination here much easier.
The main trick with PD is that instead of an agent only having two possible actions C and D, we consider many programs the agent might self-modify into (commit to becoming) that each might in the end compute C or D. This effectively changes the action space, there are now many more possible actions. And these programs/actions can be given access (like quines, by their own construction) to initial source code of all the agents, allowed to reason about them. But then programs have logical uncertainty about how they in the end behave, so the things you’d be enumerating don’t immediately cash out in expected values. And these programs can decide to cause different expected values depending of what you’ll do with their behavior, anticipate how you reason about them through reasoning about you in turn. It’s hard to find clear arguments for why any particular desirable thing could happen as a result of this setup.
UDT is notable for being one way of making this work. The “open source game theory” of PD (through Löb’s theorem, modal fixpoints, Payor’s lemma) pinpoints some cases where it’s possible to say that we get cooperation in PD. But in general it’s proven difficult to say anything both meaningful and flexible about this seemingly in-broad-strokes-inevitable setup, in particular for agents with different values that are doing more general things than playing PD.
(The following relies a little bit on motivation given in the other comment.)
When both A and B consider listening to a shared subagent C, subagent C is itself considering what it should be doing, depending on what A and B do with C‘s behavior. So for example with A there are two stages of computation to consider: first, it was A and didn’t yet decide to sign the contract, then it became a composite system P(C), where P is A’s policy for giving influence to C’s behavior (possibly P and A include a larger part of the world where the first agent exists, not just the agent itself). The commitment of A is to the truth of the equality A=P(C), which gives C influence over the computational consequences of A in the particular shape P. The trick with the logical time of this process is that C should be able to know (something about) P updatelessly, without being shown observations of what it is, so that the instance of C within B would also know of P and be able to take it into account in choosing its joint policy that acts both through A and B. (Of course, the same is happening within B.)
This sketch frames decision making without directly appealing to consequentialism. Here, A controls B through the incentives P it creates for C (a particular way in which C gets to project influence from A‘s place in the world), where C also has influence over B. So A doesn’t seek to manipulate B directly by considering the consequences for B’s behavior of various ways that A might behave.