The mistake that a causal decision theorist makes isn’t in two-boxing. It’s in being a causal decision theorist in the first place. In Newcombian games, the assumption that there is a highly-accurate predictor of you makes it clear that you are, well, predictable and not really making free choices. You’re just executing whatever source code you’re running. If this predictor thinks that you will two-box it, your fate is sealed and the best you can do is then to two-box it. The key is to just be running the right source code.
So my concern with FDT (read: not a criticism, just something I haven’t been convinced of yet) is that there is some a priori “right” source-code that we can choose in advance before we go into a situation. This is because, while we may sometimes benefit from having source-code A that leads predictor Alpha to a particular conclusion in one situation, we may also sometimes prefer source-code B that leads predictor Beta to a different conclusion. If we don’t know how likely it is that we’ll run into Alpha relative to Beta, we then have no idea what source-code we should adopt (note that I say adopt because, obviously, we can’t control the source-code we were created with). My guess is that , somewhere out there, there’s some kind of No-Free-Lunch theorem that shows this for embedded agents
Moreover, (and this relates to your Mind-crime example), I think that in situations with sometimes adversarial predictors who do make predictions based on non-mere correlations with your source-code, there is a pressure for agents to make these correlations as weak as possible.
For instance, consider the following example of how decision theory can get messy. It’s basically a construction where Parfit’s Hitchiker situations are sometimes adversarially engineered based on performing a Counterfactual Mugging on agent decision theory.
Predictor Alpha
Some number of Predictor Alpha exists. Predictor Alpha’s goal is to put you in a Parfit’s Hitchhiker situation and request that, once you’re safe, you pay a yearly 10% tithe to Predictor Alpha for saving you.
If you’re an agent running CDT, it’s impossible for Alpha to do this because you cannot commit to paying the tithe once you’re safe. This is true regardless of whether you know that Alpha is adversarial. As a result, you never get put in these situations.
If you’re an agent running FDT and you don’t know that Alpha is adversarial, Alpha can do this. If you do know though, you do fine because FDT will just pre-commit to performing CDT if it thinks a Parfit’s Hitchiker situation has been caused adversarially. But this just incenvitizes Alpha to make it very hard to tell whether it’s adversarial (and, compared to making accurate predictions about behavior in Hitchiker problems, this seems relatively easy). The easiest FDT solution in this case is to just make Predictor Alpha think you’re running CDT or actually run CDT.
Note though that if FDT agents are incentivized to eliminate non-mere correlatins between “implementing FDT” and “what Alpha predicts”, this applies to any predictor working off the same information as Alpha. This has consequences.
Predictor Beta
Some number of Predictor Beta exists. Predictor Beta makes money by finding people who get stuck in Parfit’s Hitchhiker situations and helping them out in exchange for a 10% yearly tithe. Given that Predictor Alpha also exists...
If you’re an agent running CDT, who knows what happens? Predictor Beta is, of course, making predictions based on non-mere correlations with your source-code but, because of Predictor Alphas, FDT and CDT agents look really similar! Maybe, Predictor Beta figures you’re just a secret FDT agent and you get away with making unguaranteed precommitment.
Ditto with agents running FDT. Maybe you fool Predictor Beta just as much as you fool Predictor Alpha and Beta assumes you’re running CDT. Maybe not.
Predictor Gamma
We might also introduce Predictor Gamma, who is altruistic and helps all CDT agents out of the goodness of their heart but tries to get some payment from FDT agents since they might be extortable. However, because Gamma knows that FDT agents will pretend to be CDT or refuse to precommit to get the altruistic benefit (and Gamma believes beggars can’t be choosers), Gamma also resolves to just let FDT agents which are unwilling to precommit die.
Now, the FDT agent has two options
1. Prevent Gamma from being accurate by eliminating Gamma’s ability to idenfity the genuine capacity to make guaranteed precommitments. This comes at the cost of eliminating non-mere correlations between FDT source code and other predictions that others might make about guaranteed precommitment. This throws away a lot of the benefit of FDT though.
2. Update to CDT.
What if we assume the predictors are really smart and we can’t eliminate non-mere correlations?
In this case, FDT agents have to actually decide whether to continue being FDT (and risk meeting with Predictor Alpha) or actually, truly, update to CDT (and risk being abandoned by Predictor Beta). Personally, I would lean towards the latter because, if accurate predictors exist, it seems more likely that I would be adversarially placed in a Parfit’s Hitchhiker situation than it would be that I’d accidentally find myself in one.
But maybe FDT still is better?!
So the above thought experiments ignore the possibility that Predictors Alpha or Beta may offer CDT agents the ability to make binding precommitments a la FDT. Then Alpha could adversarially put a CDT agent in a Parfit’s Hitchhiker situation and get the CDT agent to update to making a binding precommitment. In contrast, an FDT agent could just avoid this completely by commting their source-code to a single adjustment: never permit situations ensuring binding pre-commitments. But this obviously has a bunch of drawbacks in a bunch of situations too.
I really like this analysis. Luckily, with the right framework, I think that these questions, though highly difficult, are technical but no longer philosophical. This seems like a hard question of priors but not a hard question of framework. I speculate that in practice, an agent could be designed to adaptively and non-permanently modify its actions and source code to slick past many situations, fooling predictors exploiting non-mere correlations when helpful.
But on the other hand, maybe a way to induce a great amount of uncertainty to mess up certain agents, you could introduce a set of situations like this to them.
For a few of my thoughts on these situations, I also wrote this post on Medium. https://medium.com/@thestephencasper/decision-theory-ii-going-meta-5bc9970bd2b9. However, I’m sure you’ve thought more in depth about these issues than this post goes. I think that the example of mind policing predictors is sufficient to show that there is no free lunch in decision theory. for every decision theory, there is a mind police predictor that will destroy it.
I really like this analysis. Luckily, with the right framework, I think that these questions, though highly difficult, are technical but no longer philosophical. This seems like a hard question of priors but not a hard question of framework.
Yes—I agree with this. It turns the question “What is the best source-code for making decisions when the situations you are placed on depend on that source-code?” into a question more like”Okay, since there are a bunch of decisions that are contingent on source-code, which ones do we expect to actually happen and with what frequency?” And this is something we can, in principle, reason about (ie, we can speculate on what incentives we would expect predictors to have and try to estimate uncertainties of different situations happening).
I speculate that in practice, an agent could be designed to adaptively and non-permanently modify its actions and source code to slick past many situations, fooling predictors exploiting non-mere correlations when helpful.
I’m skeptical of this. Non-mere correlations are consequences of an agent’s source-code producing particular behaviors that the predictor can use to gain insight into the source-code itself. If an agent adaptively and non-permanently modifies its souce-code, this (from the perspective of a predictor who suspects this to be true), de-correlates it’s current source code from the non-mere correlations of its past behavior—essentially destroying the meaning of non-mere correlations to the extent that the predictor is suspicious.
Maybe there’s a clever way to get around this. But, to illustrate the problem with a claim from your blog:
For example, a dynamic user of CDT could avoid being destroyed by a mind reader with zero tolerance for CDT by modifying its hardware to implement EDT instead.
This is true for a mind reader that is directly looking at source code but is untrue for predictions relying on non-mere correlations. To such a predictor, a dynamic user of CDT who has just updated to EDT would have a history of CDT behavior and non-mere correlations associated mostly with CDT. Now two things might happen:
1. The predictor classifies the agent as CDT and kills it
2. The predictor classifies the agent as a dynamic user of CDT, predicts that it has updated to EDT, and does not kill it.
Option 1 isn’t great because the agent gets killed. Option 2 also isn’t great because it implies predictors have access to non-mere correlations strong enough to indicate that a given agent can dynamically update. This is risky because now any predictor that leverages these non-mere correlations to conclude that another agent is dynamic can potentially benefit for adversarially pushing that agent to modify to a more exploitable source-code. For example, a predictor might want to make the agent believe that it kills all agents it predicts aren’t EDT but, in actuality, doesn’t care about that and just subjects all the new EDT agents to XOR blackmail.
There’s also other practical concerns. For instance, an agent capable of self-modifying its source-code is in principle capable of guaranteeing a precommitment by modifying part of its code to catastrophically destruct the agent if the agent either doesn’t follow through on the precommitment or appears to be attacking that code-piece. This is similar to the issue I mention in my “But maybe FDT is still better?!” section. It might be advantageous to just make yourself incapable of being put in adversarial Parfit’s Hitchhiker situations in advance.
I think that the example of mind policing predictors is sufficient to show that there is no free lunch in decision theory. for every decision theory, there is a mind police predictor that will destroy it.
On one hand this is true. On the other, I personally shy away from mind-police type situations because they can trivially be applied to any decision theory. I think, when I mentioned No-Free Lunch for decision theory, it was in reference specifically to Non-Mere Correlation Management strategies in our universe as it currently exists.
For instance, given certain assumptions, we can make claims about which decision theories are good. For instance, CDT works amazingly well in the class of universes where agents know the consequences of all their actions. FDT (I think) works amazingly well in the class of universes where agents know how non-merely correlated their decisions are to events in the universe but don’t know why those correlations exist.
But I’m not sure if, in our actual universe FDT is a practically better thing to do than CDT. Non-mere correlations only really pertain to predictors (ie other agents) and I’d expect the perception of Non-mere correlations to be very adversarially manipulated: “Identifying non-mere correlations and decorrelating them” is a really good way to exploit predictors and “creating the impression that correlations are non-mere” is a really good way for predictors to exploit FDT.
Because of this, FDT strikes me as performing better than CDT in a handful of rare scenarios but may overall be subjected to some no-free-lunch theorem that applies specifically to the kind of universe that we are in. I guess that’s what I’m thinking about here.
I’m skeptical of this. Non-mere correlations are consequences of an agent’s source-code producing particular behaviors that the predictor can use to gain insight into the source-code itself. If an agent adaptively and non-permanently modifies its souce-code, this (from the perspective of a predictor who suspects this to be true), de-correlates it’s current source code from the non-mere correlations of its past behavior—essentially destroying the meaning of non-mere correlations to the extent that the predictor is suspicious.
Oh yes. I agree with what you mean. When I brought up the idea about an agent strategically acting certain ways or overwriting itself to confound the predictions that adversarial predictors, I had in mind that the correlations that such predictors used could be non-mere w.r.t. the reference class of agents these predictors usually deal but still confoundable by our design of the agent and thereby non mere to us.
For instance, given certain assumptions, we can make claims about which decision theories are good. For instance, CDT works amazingly well in the class of universes where agents know the consequences of all their actions. FDT (I think) works amazingly well in the class of universes where agents know how non-merely correlated their decisions are to events in the universe but don’t know why those correlations exist.
+1 to this. I agree that this is the right question to be asking, that it depends on a lot of assumptions about how adversarial an environment is, and that FDT does indeed seem to have some key advantages.
Also as a note, sorry for some differences in terminology between this post and the one I linked to on my Medium blog.
So my concern with FDT (read: not a criticism, just something I haven’t been convinced of yet) is that there is some a priori “right” source-code that we can choose in advance before we go into a situation. This is because, while we may sometimes benefit from having source-code A that leads predictor Alpha to a particular conclusion in one situation, we may also sometimes prefer source-code B that leads predictor Beta to a different conclusion. If we don’t know how likely it is that we’ll run into Alpha relative to Beta, we then have no idea what source-code we should adopt (note that I say adopt because, obviously, we can’t control the source-code we were created with). My guess is that , somewhere out there, there’s some kind of No-Free-Lunch theorem that shows this for embedded agents
Moreover, (and this relates to your Mind-crime example), I think that in situations with sometimes adversarial predictors who do make predictions based on non-mere correlations with your source-code, there is a pressure for agents to make these correlations as weak as possible.
For instance, consider the following example of how decision theory can get messy. It’s basically a construction where Parfit’s Hitchiker situations are sometimes adversarially engineered based on performing a Counterfactual Mugging on agent decision theory.
Predictor Alpha
Some number of Predictor Alpha exists. Predictor Alpha’s goal is to put you in a Parfit’s Hitchhiker situation and request that, once you’re safe, you pay a yearly 10% tithe to Predictor Alpha for saving you.
If you’re an agent running CDT, it’s impossible for Alpha to do this because you cannot commit to paying the tithe once you’re safe. This is true regardless of whether you know that Alpha is adversarial. As a result, you never get put in these situations.
If you’re an agent running FDT and you don’t know that Alpha is adversarial, Alpha can do this. If you do know though, you do fine because FDT will just pre-commit to performing CDT if it thinks a Parfit’s Hitchiker situation has been caused adversarially. But this just incenvitizes Alpha to make it very hard to tell whether it’s adversarial (and, compared to making accurate predictions about behavior in Hitchiker problems, this seems relatively easy). The easiest FDT solution in this case is to just make Predictor Alpha think you’re running CDT or actually run CDT.
Note though that if FDT agents are incentivized to eliminate non-mere correlatins between “implementing FDT” and “what Alpha predicts”, this applies to any predictor working off the same information as Alpha. This has consequences.
Predictor Beta
Some number of Predictor Beta exists. Predictor Beta makes money by finding people who get stuck in Parfit’s Hitchhiker situations and helping them out in exchange for a 10% yearly tithe. Given that Predictor Alpha also exists...
If you’re an agent running CDT, who knows what happens? Predictor Beta is, of course, making predictions based on non-mere correlations with your source-code but, because of Predictor Alphas, FDT and CDT agents look really similar! Maybe, Predictor Beta figures you’re just a secret FDT agent and you get away with making unguaranteed precommitment.
Ditto with agents running FDT. Maybe you fool Predictor Beta just as much as you fool Predictor Alpha and Beta assumes you’re running CDT. Maybe not.
Predictor Gamma
We might also introduce Predictor Gamma, who is altruistic and helps all CDT agents out of the goodness of their heart but tries to get some payment from FDT agents since they might be extortable. However, because Gamma knows that FDT agents will pretend to be CDT or refuse to precommit to get the altruistic benefit (and Gamma believes beggars can’t be choosers), Gamma also resolves to just let FDT agents which are unwilling to precommit die.
Now, the FDT agent has two options
1. Prevent Gamma from being accurate by eliminating Gamma’s ability to idenfity the genuine capacity to make guaranteed precommitments. This comes at the cost of eliminating non-mere correlations between FDT source code and other predictions that others might make about guaranteed precommitment. This throws away a lot of the benefit of FDT though.
2. Update to CDT.
What if we assume the predictors are really smart and we can’t eliminate non-mere correlations?
In this case, FDT agents have to actually decide whether to continue being FDT (and risk meeting with Predictor Alpha) or actually, truly, update to CDT (and risk being abandoned by Predictor Beta). Personally, I would lean towards the latter because, if accurate predictors exist, it seems more likely that I would be adversarially placed in a Parfit’s Hitchhiker situation than it would be that I’d accidentally find myself in one.
But maybe FDT still is better?!
So the above thought experiments ignore the possibility that Predictors Alpha or Beta may offer CDT agents the ability to make binding precommitments a la FDT. Then Alpha could adversarially put a CDT agent in a Parfit’s Hitchhiker situation and get the CDT agent to update to making a binding precommitment. In contrast, an FDT agent could just avoid this completely by commting their source-code to a single adjustment: never permit situations ensuring binding pre-commitments. But this obviously has a bunch of drawbacks in a bunch of situations too.
I really like this analysis. Luckily, with the right framework, I think that these questions, though highly difficult, are technical but no longer philosophical. This seems like a hard question of priors but not a hard question of framework. I speculate that in practice, an agent could be designed to adaptively and non-permanently modify its actions and source code to slick past many situations, fooling predictors exploiting non-mere correlations when helpful.
But on the other hand, maybe a way to induce a great amount of uncertainty to mess up certain agents, you could introduce a set of situations like this to them.
For a few of my thoughts on these situations, I also wrote this post on Medium. https://medium.com/@thestephencasper/decision-theory-ii-going-meta-5bc9970bd2b9. However, I’m sure you’ve thought more in depth about these issues than this post goes. I think that the example of mind policing predictors is sufficient to show that there is no free lunch in decision theory. for every decision theory, there is a mind police predictor that will destroy it.
Yes—I agree with this. It turns the question “What is the best source-code for making decisions when the situations you are placed on depend on that source-code?” into a question more like”Okay, since there are a bunch of decisions that are contingent on source-code, which ones do we expect to actually happen and with what frequency?” And this is something we can, in principle, reason about (ie, we can speculate on what incentives we would expect predictors to have and try to estimate uncertainties of different situations happening).
I’m skeptical of this. Non-mere correlations are consequences of an agent’s source-code producing particular behaviors that the predictor can use to gain insight into the source-code itself. If an agent adaptively and non-permanently modifies its souce-code, this (from the perspective of a predictor who suspects this to be true), de-correlates it’s current source code from the non-mere correlations of its past behavior—essentially destroying the meaning of non-mere correlations to the extent that the predictor is suspicious.
Maybe there’s a clever way to get around this. But, to illustrate the problem with a claim from your blog:
This is true for a mind reader that is directly looking at source code but is untrue for predictions relying on non-mere correlations. To such a predictor, a dynamic user of CDT who has just updated to EDT would have a history of CDT behavior and non-mere correlations associated mostly with CDT. Now two things might happen:
1. The predictor classifies the agent as CDT and kills it
2. The predictor classifies the agent as a dynamic user of CDT, predicts that it has updated to EDT, and does not kill it.
Option 1 isn’t great because the agent gets killed. Option 2 also isn’t great because it implies predictors have access to non-mere correlations strong enough to indicate that a given agent can dynamically update. This is risky because now any predictor that leverages these non-mere correlations to conclude that another agent is dynamic can potentially benefit for adversarially pushing that agent to modify to a more exploitable source-code. For example, a predictor might want to make the agent believe that it kills all agents it predicts aren’t EDT but, in actuality, doesn’t care about that and just subjects all the new EDT agents to XOR blackmail.
There’s also other practical concerns. For instance, an agent capable of self-modifying its source-code is in principle capable of guaranteeing a precommitment by modifying part of its code to catastrophically destruct the agent if the agent either doesn’t follow through on the precommitment or appears to be attacking that code-piece. This is similar to the issue I mention in my “But maybe FDT is still better?!” section. It might be advantageous to just make yourself incapable of being put in adversarial Parfit’s Hitchhiker situations in advance.
On one hand this is true. On the other, I personally shy away from mind-police type situations because they can trivially be applied to any decision theory. I think, when I mentioned No-Free Lunch for decision theory, it was in reference specifically to Non-Mere Correlation Management strategies in our universe as it currently exists.
For instance, given certain assumptions, we can make claims about which decision theories are good. For instance, CDT works amazingly well in the class of universes where agents know the consequences of all their actions. FDT (I think) works amazingly well in the class of universes where agents know how non-merely correlated their decisions are to events in the universe but don’t know why those correlations exist.
But I’m not sure if, in our actual universe FDT is a practically better thing to do than CDT. Non-mere correlations only really pertain to predictors (ie other agents) and I’d expect the perception of Non-mere correlations to be very adversarially manipulated: “Identifying non-mere correlations and decorrelating them” is a really good way to exploit predictors and “creating the impression that correlations are non-mere” is a really good way for predictors to exploit FDT.
Because of this, FDT strikes me as performing better than CDT in a handful of rare scenarios but may overall be subjected to some no-free-lunch theorem that applies specifically to the kind of universe that we are in. I guess that’s what I’m thinking about here.
Oh yes. I agree with what you mean. When I brought up the idea about an agent strategically acting certain ways or overwriting itself to confound the predictions that adversarial predictors, I had in mind that the correlations that such predictors used could be non-mere w.r.t. the reference class of agents these predictors usually deal but still confoundable by our design of the agent and thereby non mere to us.
+1 to this. I agree that this is the right question to be asking, that it depends on a lot of assumptions about how adversarial an environment is, and that FDT does indeed seem to have some key advantages.
Also as a note, sorry for some differences in terminology between this post and the one I linked to on my Medium blog.