Great lecture and article. This cleared up a lot of things for me. One thing I don’t understand. You describe how an adversary can “go back in time” by simulating an earlier stage of an agent which started as CDT and self-modified to an improved decision theory, and so force the agent not to self-modify in that way.
You said that if the CDT agent would modify to be unblackmailable, the adversary could simulate an earlier version of that agent (the CDT version) and force it not modify to be unblackmailable.
This reminds me of another case: As has been said elsewhere (Yudkowsky
Timeless Decision Theory pages 18-19), if an adversary acts according to the internal algorithm that an agent, uses, then that agent is stuck. This is “cheating” in the sense that it is outside the bounds of MIRI’s current work on reflective decision theory.
I understand that simulation of an adversary, or of variants of the adversary, is a perfectly ordinary action, which we humans do (to a limited extent) in dealing with other humans. Yet I am a bit confused: It seems to me that simulating an going back in time in this way to keep the adversary from self-modifying somehow is “cheating” too—i.e., stretching the parameters of MIRI’s investigations of how agents should make decisions. Could you clear up what you mean by this sort of counterfactual, backwards-in-time extortion-by-simulation?
The adversary simulates the AI from its original source up until it is blackmailed by the adversaries. (In practice, the adversaries don’t need to actually simulate this out, they can just check what decision theory the agent uses, but it’s a better intuition pump if you imagine them simulating the AI.)
The trouble with this scenario is not that the adversary is somehow “forcing” the AI to not modify, rather, the trouble is that when CDT considers self-modifying so that the agent succeeds on theses sorts of problems, it concludes that it’s already too late (even though it isn’t). In other words, this is a flaw that CDT reports is not a flaw.
There are many flaws that CDT can be expected to fix because CDT recognizes them as flaws (e.g. when CDT self-modifies to stop using CDT inside new mirror token trades). But if a CDT agent finds that it is already in a mirror token trade, then CDT will say that it should not self-modify to give its token away because it cannot guarantee that its perfect copy would do the same thing. This is a flaw that CDT does not report as a flaw, which is why CDT fails.
The blackmail scenario essentially generates a similar problem in self-modifying agents. A CDT-agent could self-modify to patch its blackmailability, but CDT reports that such patches have no upside (it incorrectly thinks a simulation spawned from its original source is logically independent because it is causally independent, therefore it thinks that its choice to patch the blackmailability does not affect the simulation’s choice) and a potential downside (if it patches but the simulation doesn’t, then the bomb will go off) and so it doesn’t correct this flaw.
This is “cheating” in the sense that it is outside the bounds of MIRI’s current work on reflective decision theory.
Eh, not really. The question of extensional decision problems is separate issue, and the examination of non-extensional DPs is not entirely outside our scope. It’s a complex topic and we’ll hopefully have a writeup about unfair extensional decision problems sometime in the next few months. (This is one of the places where we have a number of small results and not enough time to write them up.)
Thanks. I see that you brought up that “simulate-agent’s-former-self” attach as another example of CDT’s inability to understand certain causal links to its own decision processes.
Great lecture and article. This cleared up a lot of things for me. One thing I don’t understand. You describe how an adversary can “go back in time” by simulating an earlier stage of an agent which started as CDT and self-modified to an improved decision theory, and so force the agent not to self-modify in that way.
You said that if the CDT agent would modify to be unblackmailable, the adversary could simulate an earlier version of that agent (the CDT version) and force it not modify to be unblackmailable.
This reminds me of another case: As has been said elsewhere (Yudkowsky Timeless Decision Theory pages 18-19), if an adversary acts according to the internal algorithm that an agent, uses, then that agent is stuck. This is “cheating” in the sense that it is outside the bounds of MIRI’s current work on reflective decision theory.
I understand that simulation of an adversary, or of variants of the adversary, is a perfectly ordinary action, which we humans do (to a limited extent) in dealing with other humans. Yet I am a bit confused: It seems to me that simulating an going back in time in this way to keep the adversary from self-modifying somehow is “cheating” too—i.e., stretching the parameters of MIRI’s investigations of how agents should make decisions. Could you clear up what you mean by this sort of counterfactual, backwards-in-time extortion-by-simulation?
The adversary simulates the AI from its original source up until it is blackmailed by the adversaries. (In practice, the adversaries don’t need to actually simulate this out, they can just check what decision theory the agent uses, but it’s a better intuition pump if you imagine them simulating the AI.)
The trouble with this scenario is not that the adversary is somehow “forcing” the AI to not modify, rather, the trouble is that when CDT considers self-modifying so that the agent succeeds on theses sorts of problems, it concludes that it’s already too late (even though it isn’t). In other words, this is a flaw that CDT reports is not a flaw.
There are many flaws that CDT can be expected to fix because CDT recognizes them as flaws (e.g. when CDT self-modifies to stop using CDT inside new mirror token trades). But if a CDT agent finds that it is already in a mirror token trade, then CDT will say that it should not self-modify to give its token away because it cannot guarantee that its perfect copy would do the same thing. This is a flaw that CDT does not report as a flaw, which is why CDT fails.
The blackmail scenario essentially generates a similar problem in self-modifying agents. A CDT-agent could self-modify to patch its blackmailability, but CDT reports that such patches have no upside (it incorrectly thinks a simulation spawned from its original source is logically independent because it is causally independent, therefore it thinks that its choice to patch the blackmailability does not affect the simulation’s choice) and a potential downside (if it patches but the simulation doesn’t, then the bomb will go off) and so it doesn’t correct this flaw.
Eh, not really. The question of extensional decision problems is separate issue, and the examination of non-extensional DPs is not entirely outside our scope. It’s a complex topic and we’ll hopefully have a writeup about unfair extensional decision problems sometime in the next few months. (This is one of the places where we have a number of small results and not enough time to write them up.)
Thanks. I see that you brought up that “simulate-agent’s-former-self” attach as another example of CDT’s inability to understand certain causal links to its own decision processes.