There’s one scenario described in this paper on which this decision theory gives in to blackmail:
The Retro Blackmail problem. There is a wealthy intelligent system and an honest AI researcher with access to the agent’s original source code. The researcher may deploy a virus that will cause $150 million each in damages to both the AI system and the researcher, and which may only be deactivated if the agent pays the researcher $100 million. The researcher is risk-averse and only deploys the virus upon becoming confident that the agent will pay up. The agent knows the situation and has an opportunity to self-modify after the researcher acquires its original source code but before the researcher decides whether or not to deploy the virus. (The researcher knows this, and has to factor this into their prediction.)
The paper you link to shows that a pure CDT agent would not self modify into an NDT agent, because a CDT agent wouldn’t really have the concept of “logical” connections between agents. The understanding that both logical and causal connections are real things is what would compel an agent to self-modify to NDT.
However, if there was some path by which an agent started out as pure CDT and then became NDT, the NDT agent would still choose correctly on Retro Blackmail even if the researcher had its original CDT source code. The NDT agent’s decision procedure explicitly tells it to behave as if it had precommitted before the researcher got its source code.
So even if the CDT --> NDT transition is impossible, since I don’t think any of us here are pure CDT agents, we can still adopt NDT and profit.
In the retro blackmail, CDT does not precommit to refusing even if it’s given the opportunity to do so before the researcher gets its source code. This is because CDT believes that the researcher is predicting according to a causally disconnected copy of itself, and therefore it does not believe that its actions can affect the copy. (That is, if CDT knows it is going to be retro blackmailed, and considers this before the researcher gets access to its source code, then it still doesn’t precommit.) The failure here is that CDT only reasons according to what it can causally affect, but in the real world decision algorithms also need to worry about what they can logically affect (For example, two agents created while spacelike separated should be able to cooperate on a Prisoner’s Dilemma.)
Your attempted patch (pretend you made your precommitments earlier in time) only works when the neglected logical relationships stem from a causal event earlier in time. This is often but not always the case. For instance, if CDT thinks that its clone was causally copied from its own source code, then you can get the right answer by acting as CDT would have precommitted to act before the copying occurred. But two agents written in spacelike separation from each other might have decision algorithms that are logically correlated, despite there being no causal connection no matter how far back you go.
In order to get the right precommitments in those sorts of scenarios, you need to formalize some sort of notion of “things the decision algorithm’s choice logically affects,” and formalizing “logical effects” is basically the part of the problem that remains difficult :-)
In the retro blackmail, CDT does not precommit to refusing even if it’s given the opportunity to do so before the researcher gets its source code.
To clarify: you mean that CDT doesn’t precommit at time t=1 even if the researcher hasn’t gotten the code representing CDT’s state at time t=0 yet. The CDT doesn’t think precommitting will help because it knows the code the researcher will get will be from before its precommitment. I agree that this is true, and a CDT won’t want to precommit.
I guess my definition even after my clarification is ambiguous, as it’s not clear that what a CDT wishes it could have precomitted to at an earlier time should take precedence over what it would wish to precommit to at a later time. NDT seems to be best when you always prefer the earliest precommitment. The intuition is something like:
You should always make the decision that a CDT-agent would have wished he had precommitted to, if he had magically had the opportunity to costlessly precommit to to a decision at a time before the beginning of the universe.
This would allow you to act is if you had precommitted to things before you existed.
But two agents written in spacelike separation from each other might have decision algorithms that are logically correlated, despite there being no causal connection no matter how far back you go.
Can you give an example of this? Similar to the calculator example in the TDT paper, I’m imagining some scenario where one AI takes instructions for creating you to another galaxy, and another AI keeps a copy of the instructions for creating you on Earth. At some point, both AIs read the instructions and create identical beings, one of which is you. The AI that created you says that you’ll be playing a prisoner’s dilemma game with the other entity created in the same way, and asks for your decision.
In some sense, there is only a logical connection between these two entities because they’ve only existed for a short time and are too far away to have a causal effect on each other. However they are very causally related, and I could probably make an argument that they are replicas of the same person.
Do you have an example of a logical connection that has no causal connection at all (or as minimal a causal connection as possible)?
The universe begins, and then almost immediately, two different alien species make AIs while spacelike separated. The AIs start optimizing their light cones and meet in the middle, and must play a Prisoner’s Dilemma.
There is absolutely no causal relationship between them before the PD, so it doesn’t matter what precommitments they would have made at the beginning of time :-)
To be clear, this sort of thought experiment is meant to demonstrate why your NDT is not optimal; it’s not meant to be a feasible example. The reason we’re trying to formalize “logical effect” is not specifically so that our AIs can cooperate with independently developed alien AIs or something (although that would be a fine perk). Rather, this extreme example is intended to demonstrate why idealized counterfactual reasoning needs to take logical effects into account. Other thought experiments can be used to show that reasoning about logical effects matters in more realistic scenarios, but first it’s important to realize that they matter at all :-)
I think defect is the right answer in your AI problem and therefore that NDT gets it right, but I’m aware lots of LWers think otherwise. I haven’t researched this enough to want to argue it, but is there a discussion you’d recommend I read that spells out the reasoning? Otherwise I’ll just look through LW posts on prisoner’s dilemmas.
Secondly, I’d like to try to somehow incorporate logical effects into NDT. I agree they’re important. Any suggestions for where I could find lots of examples of decision problems where logical effects matter, to help me think about the general case?
I think defect is the right answer in your AI problem and therefore that NDT gets it right
That’s surprising to me. Imagine that the situation is “prisoner’s dilemma with shared source code”, and that the AIs inspect each other’s source code and verify that (by some logical but non-causal miracle) they have exactly identical source code. Do you still think they do better to defect? I wouldn’t want to build an agent that defects in that situation :-p
An AI should certainly cooperate if it discovered that by chance its opposing AI had identical source code.
I read your paper and the two posts in your short sequence. Thanks for the links. I still think it’s very unlikely that one of the AIs in your original hypothetical (when they don’t examine each other’s source code) would do better by defecting.
I accept that if an opposing AI had a model of you that was just decent but not great, then there is some amount of logical connection there. What I haven’t seen is any argument about the shape of the graph of logical connection strength vs similarity of entities. I hypothesize that for any two humans who exist today, if you put them in a one shot PD, the logical connection is negligible.
Has anyone written specifically on how exactly to give weights to logical connections between similar but non-identical entities?
Has anyone written specifically on how exactly to give weights to logical connections between similar but non-identical entities?
Nope! That’s the open part of the problem :-) We don’t know how to build a decision network with logical nodes, and we don’t know how to propagate a “logical update” between nodes. (That is, we don’t have a good formalism of how changing one algorithm logically affects a related but non-identical algorithm.)
If we had the latter thing, we wouldn’t even need the “logical decision network”, because we could just ask “if I change the agent, how does that logically affect the universe?” (as both are algorithms); this idea is the basis of proof-based UDT (which tries to answer the problem by searching for proofs under the assumption “Agent()=a” for various actions). Proof based UDT has lots of problems of its own, though, and thinking about logical updates in logical graphs is a fine angle of approach.
Thanks. I had one question about your Toward Idealized Decision Theory paper.
I can’t say I fully understand UDT, but the ‘updateless’ part does seem very similar to the “act as if you had precommitted to any action that you’d have wanted to precommit to” core idea of NDT. It’s not clear to me that the super powerful UDT would make the wrong decision in the game where two players pick numbers between 0-10 and get payouts based on their pick and the total sum.
Wouldn’t the UDT reason as follows? “If my algorithm were such that I wouldn’t just pick 1 when the human player forced me into it by picking 9 (for instance maybe I always pick 5 in this game), then I may still have a reputation as a powerful predictor but it’s much more likely that I’d also have a reputation as an entity that can’t be bullied like this, so the human would be less likely to pick 9. That state of the world is better for me, so I shouldn’t be the type of agent that makes the greedy choice to pick 1 when I predict the human will pick 9.”
The argument in your paper seems to rely on the human assuming the UDT will reason like a CDT once it knows the human will pick 9.
the ‘updateless’ part does seem very similar to the “act as if you had precommitted to any action that you’d have wanted to precommit to” core idea of NDT
Yep, that’s a common intuition pump people use in order to understand the “updateless” part of UDT.
It’s not clear to me that the super powerful UDT would make the wrong decision in the game where two players pick numbers between 0-10
A proof-based UDT agent would—this follows from the definition of proof-based UDT. Intuitively, we surely want a decision theory that reasons as you said, but the question is, can you write down a decision algorithm that actually reasons like that?
Most people agree with you on the philosophy of how an idealized decision theory should act, but the hard part is formalizing a decision theory that actually does the right things. The difficult part isn’t in the philosophy, the difficult part is turning the philosophy into math :-)
There’s one scenario described in this paper on which this decision theory gives in to blackmail:
I believe that NDT gets this problem right.
The paper you link to shows that a pure CDT agent would not self modify into an NDT agent, because a CDT agent wouldn’t really have the concept of “logical” connections between agents. The understanding that both logical and causal connections are real things is what would compel an agent to self-modify to NDT.
However, if there was some path by which an agent started out as pure CDT and then became NDT, the NDT agent would still choose correctly on Retro Blackmail even if the researcher had its original CDT source code. The NDT agent’s decision procedure explicitly tells it to behave as if it had precommitted before the researcher got its source code.
So even if the CDT --> NDT transition is impossible, since I don’t think any of us here are pure CDT agents, we can still adopt NDT and profit.
In the retro blackmail, CDT does not precommit to refusing even if it’s given the opportunity to do so before the researcher gets its source code. This is because CDT believes that the researcher is predicting according to a causally disconnected copy of itself, and therefore it does not believe that its actions can affect the copy. (That is, if CDT knows it is going to be retro blackmailed, and considers this before the researcher gets access to its source code, then it still doesn’t precommit.) The failure here is that CDT only reasons according to what it can causally affect, but in the real world decision algorithms also need to worry about what they can logically affect (For example, two agents created while spacelike separated should be able to cooperate on a Prisoner’s Dilemma.)
Your attempted patch (pretend you made your precommitments earlier in time) only works when the neglected logical relationships stem from a causal event earlier in time. This is often but not always the case. For instance, if CDT thinks that its clone was causally copied from its own source code, then you can get the right answer by acting as CDT would have precommitted to act before the copying occurred. But two agents written in spacelike separation from each other might have decision algorithms that are logically correlated, despite there being no causal connection no matter how far back you go.
In order to get the right precommitments in those sorts of scenarios, you need to formalize some sort of notion of “things the decision algorithm’s choice logically affects,” and formalizing “logical effects” is basically the part of the problem that remains difficult :-)
To clarify: you mean that CDT doesn’t precommit at time t=1 even if the researcher hasn’t gotten the code representing CDT’s state at time t=0 yet. The CDT doesn’t think precommitting will help because it knows the code the researcher will get will be from before its precommitment. I agree that this is true, and a CDT won’t want to precommit.
I guess my definition even after my clarification is ambiguous, as it’s not clear that what a CDT wishes it could have precomitted to at an earlier time should take precedence over what it would wish to precommit to at a later time. NDT seems to be best when you always prefer the earliest precommitment. The intuition is something like:
You should always make the decision that a CDT-agent would have wished he had precommitted to, if he had magically had the opportunity to costlessly precommit to to a decision at a time before the beginning of the universe.
This would allow you to act is if you had precommitted to things before you existed.
Can you give an example of this? Similar to the calculator example in the TDT paper, I’m imagining some scenario where one AI takes instructions for creating you to another galaxy, and another AI keeps a copy of the instructions for creating you on Earth. At some point, both AIs read the instructions and create identical beings, one of which is you. The AI that created you says that you’ll be playing a prisoner’s dilemma game with the other entity created in the same way, and asks for your decision.
In some sense, there is only a logical connection between these two entities because they’ve only existed for a short time and are too far away to have a causal effect on each other. However they are very causally related, and I could probably make an argument that they are replicas of the same person.
Do you have an example of a logical connection that has no causal connection at all (or as minimal a causal connection as possible)?
The universe begins, and then almost immediately, two different alien species make AIs while spacelike separated. The AIs start optimizing their light cones and meet in the middle, and must play a Prisoner’s Dilemma.
There is absolutely no causal relationship between them before the PD, so it doesn’t matter what precommitments they would have made at the beginning of time :-)
To be clear, this sort of thought experiment is meant to demonstrate why your NDT is not optimal; it’s not meant to be a feasible example. The reason we’re trying to formalize “logical effect” is not specifically so that our AIs can cooperate with independently developed alien AIs or something (although that would be a fine perk). Rather, this extreme example is intended to demonstrate why idealized counterfactual reasoning needs to take logical effects into account. Other thought experiments can be used to show that reasoning about logical effects matters in more realistic scenarios, but first it’s important to realize that they matter at all :-)
I think defect is the right answer in your AI problem and therefore that NDT gets it right, but I’m aware lots of LWers think otherwise. I haven’t researched this enough to want to argue it, but is there a discussion you’d recommend I read that spells out the reasoning? Otherwise I’ll just look through LW posts on prisoner’s dilemmas.
Secondly, I’d like to try to somehow incorporate logical effects into NDT. I agree they’re important. Any suggestions for where I could find lots of examples of decision problems where logical effects matter, to help me think about the general case?
That’s surprising to me. Imagine that the situation is “prisoner’s dilemma with shared source code”, and that the AIs inspect each other’s source code and verify that (by some logical but non-causal miracle) they have exactly identical source code. Do you still think they do better to defect? I wouldn’t want to build an agent that defects in that situation :-p
The paper that jessicat linked in the parent post is a decent introduction to the notion of logical counterfactuals. See also the “Idealized Decision Theory” section of this annotated bibliography, and perhaps also this short sequence I wrote a while back.
An AI should certainly cooperate if it discovered that by chance its opposing AI had identical source code.
I read your paper and the two posts in your short sequence. Thanks for the links. I still think it’s very unlikely that one of the AIs in your original hypothetical (when they don’t examine each other’s source code) would do better by defecting.
I accept that if an opposing AI had a model of you that was just decent but not great, then there is some amount of logical connection there. What I haven’t seen is any argument about the shape of the graph of logical connection strength vs similarity of entities. I hypothesize that for any two humans who exist today, if you put them in a one shot PD, the logical connection is negligible.
Has anyone written specifically on how exactly to give weights to logical connections between similar but non-identical entities?
Nope! That’s the open part of the problem :-) We don’t know how to build a decision network with logical nodes, and we don’t know how to propagate a “logical update” between nodes. (That is, we don’t have a good formalism of how changing one algorithm logically affects a related but non-identical algorithm.)
If we had the latter thing, we wouldn’t even need the “logical decision network”, because we could just ask “if I change the agent, how does that logically affect the universe?” (as both are algorithms); this idea is the basis of proof-based UDT (which tries to answer the problem by searching for proofs under the assumption “Agent()=a” for various actions). Proof based UDT has lots of problems of its own, though, and thinking about logical updates in logical graphs is a fine angle of approach.
Thanks. I had one question about your Toward Idealized Decision Theory paper.
I can’t say I fully understand UDT, but the ‘updateless’ part does seem very similar to the “act as if you had precommitted to any action that you’d have wanted to precommit to” core idea of NDT. It’s not clear to me that the super powerful UDT would make the wrong decision in the game where two players pick numbers between 0-10 and get payouts based on their pick and the total sum.
Wouldn’t the UDT reason as follows? “If my algorithm were such that I wouldn’t just pick 1 when the human player forced me into it by picking 9 (for instance maybe I always pick 5 in this game), then I may still have a reputation as a powerful predictor but it’s much more likely that I’d also have a reputation as an entity that can’t be bullied like this, so the human would be less likely to pick 9. That state of the world is better for me, so I shouldn’t be the type of agent that makes the greedy choice to pick 1 when I predict the human will pick 9.”
The argument in your paper seems to rely on the human assuming the UDT will reason like a CDT once it knows the human will pick 9.
Yep, that’s a common intuition pump people use in order to understand the “updateless” part of UDT.
A proof-based UDT agent would—this follows from the definition of proof-based UDT. Intuitively, we surely want a decision theory that reasons as you said, but the question is, can you write down a decision algorithm that actually reasons like that?
Most people agree with you on the philosophy of how an idealized decision theory should act, but the hard part is formalizing a decision theory that actually does the right things. The difficult part isn’t in the philosophy, the difficult part is turning the philosophy into math :-)