I really like this analysis. Luckily, with the right framework, I think that these questions, though highly difficult, are technical but no longer philosophical. This seems like a hard question of priors but not a hard question of framework.
Yes—I agree with this. It turns the question “What is the best source-code for making decisions when the situations you are placed on depend on that source-code?” into a question more like”Okay, since there are a bunch of decisions that are contingent on source-code, which ones do we expect to actually happen and with what frequency?” And this is something we can, in principle, reason about (ie, we can speculate on what incentives we would expect predictors to have and try to estimate uncertainties of different situations happening).
I speculate that in practice, an agent could be designed to adaptively and non-permanently modify its actions and source code to slick past many situations, fooling predictors exploiting non-mere correlations when helpful.
I’m skeptical of this. Non-mere correlations are consequences of an agent’s source-code producing particular behaviors that the predictor can use to gain insight into the source-code itself. If an agent adaptively and non-permanently modifies its souce-code, this (from the perspective of a predictor who suspects this to be true), de-correlates it’s current source code from the non-mere correlations of its past behavior—essentially destroying the meaning of non-mere correlations to the extent that the predictor is suspicious.
Maybe there’s a clever way to get around this. But, to illustrate the problem with a claim from your blog:
For example, a dynamic user of CDT could avoid being destroyed by a mind reader with zero tolerance for CDT by modifying its hardware to implement EDT instead.
This is true for a mind reader that is directly looking at source code but is untrue for predictions relying on non-mere correlations. To such a predictor, a dynamic user of CDT who has just updated to EDT would have a history of CDT behavior and non-mere correlations associated mostly with CDT. Now two things might happen:
1. The predictor classifies the agent as CDT and kills it
2. The predictor classifies the agent as a dynamic user of CDT, predicts that it has updated to EDT, and does not kill it.
Option 1 isn’t great because the agent gets killed. Option 2 also isn’t great because it implies predictors have access to non-mere correlations strong enough to indicate that a given agent can dynamically update. This is risky because now any predictor that leverages these non-mere correlations to conclude that another agent is dynamic can potentially benefit for adversarially pushing that agent to modify to a more exploitable source-code. For example, a predictor might want to make the agent believe that it kills all agents it predicts aren’t EDT but, in actuality, doesn’t care about that and just subjects all the new EDT agents to XOR blackmail.
There’s also other practical concerns. For instance, an agent capable of self-modifying its source-code is in principle capable of guaranteeing a precommitment by modifying part of its code to catastrophically destruct the agent if the agent either doesn’t follow through on the precommitment or appears to be attacking that code-piece. This is similar to the issue I mention in my “But maybe FDT is still better?!” section. It might be advantageous to just make yourself incapable of being put in adversarial Parfit’s Hitchhiker situations in advance.
I think that the example of mind policing predictors is sufficient to show that there is no free lunch in decision theory. for every decision theory, there is a mind police predictor that will destroy it.
On one hand this is true. On the other, I personally shy away from mind-police type situations because they can trivially be applied to any decision theory. I think, when I mentioned No-Free Lunch for decision theory, it was in reference specifically to Non-Mere Correlation Management strategies in our universe as it currently exists.
For instance, given certain assumptions, we can make claims about which decision theories are good. For instance, CDT works amazingly well in the class of universes where agents know the consequences of all their actions. FDT (I think) works amazingly well in the class of universes where agents know how non-merely correlated their decisions are to events in the universe but don’t know why those correlations exist.
But I’m not sure if, in our actual universe FDT is a practically better thing to do than CDT. Non-mere correlations only really pertain to predictors (ie other agents) and I’d expect the perception of Non-mere correlations to be very adversarially manipulated: “Identifying non-mere correlations and decorrelating them” is a really good way to exploit predictors and “creating the impression that correlations are non-mere” is a really good way for predictors to exploit FDT.
Because of this, FDT strikes me as performing better than CDT in a handful of rare scenarios but may overall be subjected to some no-free-lunch theorem that applies specifically to the kind of universe that we are in. I guess that’s what I’m thinking about here.
I’m skeptical of this. Non-mere correlations are consequences of an agent’s source-code producing particular behaviors that the predictor can use to gain insight into the source-code itself. If an agent adaptively and non-permanently modifies its souce-code, this (from the perspective of a predictor who suspects this to be true), de-correlates it’s current source code from the non-mere correlations of its past behavior—essentially destroying the meaning of non-mere correlations to the extent that the predictor is suspicious.
Oh yes. I agree with what you mean. When I brought up the idea about an agent strategically acting certain ways or overwriting itself to confound the predictions that adversarial predictors, I had in mind that the correlations that such predictors used could be non-mere w.r.t. the reference class of agents these predictors usually deal but still confoundable by our design of the agent and thereby non mere to us.
For instance, given certain assumptions, we can make claims about which decision theories are good. For instance, CDT works amazingly well in the class of universes where agents know the consequences of all their actions. FDT (I think) works amazingly well in the class of universes where agents know how non-merely correlated their decisions are to events in the universe but don’t know why those correlations exist.
+1 to this. I agree that this is the right question to be asking, that it depends on a lot of assumptions about how adversarial an environment is, and that FDT does indeed seem to have some key advantages.
Also as a note, sorry for some differences in terminology between this post and the one I linked to on my Medium blog.
Yes—I agree with this. It turns the question “What is the best source-code for making decisions when the situations you are placed on depend on that source-code?” into a question more like”Okay, since there are a bunch of decisions that are contingent on source-code, which ones do we expect to actually happen and with what frequency?” And this is something we can, in principle, reason about (ie, we can speculate on what incentives we would expect predictors to have and try to estimate uncertainties of different situations happening).
I’m skeptical of this. Non-mere correlations are consequences of an agent’s source-code producing particular behaviors that the predictor can use to gain insight into the source-code itself. If an agent adaptively and non-permanently modifies its souce-code, this (from the perspective of a predictor who suspects this to be true), de-correlates it’s current source code from the non-mere correlations of its past behavior—essentially destroying the meaning of non-mere correlations to the extent that the predictor is suspicious.
Maybe there’s a clever way to get around this. But, to illustrate the problem with a claim from your blog:
This is true for a mind reader that is directly looking at source code but is untrue for predictions relying on non-mere correlations. To such a predictor, a dynamic user of CDT who has just updated to EDT would have a history of CDT behavior and non-mere correlations associated mostly with CDT. Now two things might happen:
1. The predictor classifies the agent as CDT and kills it
2. The predictor classifies the agent as a dynamic user of CDT, predicts that it has updated to EDT, and does not kill it.
Option 1 isn’t great because the agent gets killed. Option 2 also isn’t great because it implies predictors have access to non-mere correlations strong enough to indicate that a given agent can dynamically update. This is risky because now any predictor that leverages these non-mere correlations to conclude that another agent is dynamic can potentially benefit for adversarially pushing that agent to modify to a more exploitable source-code. For example, a predictor might want to make the agent believe that it kills all agents it predicts aren’t EDT but, in actuality, doesn’t care about that and just subjects all the new EDT agents to XOR blackmail.
There’s also other practical concerns. For instance, an agent capable of self-modifying its source-code is in principle capable of guaranteeing a precommitment by modifying part of its code to catastrophically destruct the agent if the agent either doesn’t follow through on the precommitment or appears to be attacking that code-piece. This is similar to the issue I mention in my “But maybe FDT is still better?!” section. It might be advantageous to just make yourself incapable of being put in adversarial Parfit’s Hitchhiker situations in advance.
On one hand this is true. On the other, I personally shy away from mind-police type situations because they can trivially be applied to any decision theory. I think, when I mentioned No-Free Lunch for decision theory, it was in reference specifically to Non-Mere Correlation Management strategies in our universe as it currently exists.
For instance, given certain assumptions, we can make claims about which decision theories are good. For instance, CDT works amazingly well in the class of universes where agents know the consequences of all their actions. FDT (I think) works amazingly well in the class of universes where agents know how non-merely correlated their decisions are to events in the universe but don’t know why those correlations exist.
But I’m not sure if, in our actual universe FDT is a practically better thing to do than CDT. Non-mere correlations only really pertain to predictors (ie other agents) and I’d expect the perception of Non-mere correlations to be very adversarially manipulated: “Identifying non-mere correlations and decorrelating them” is a really good way to exploit predictors and “creating the impression that correlations are non-mere” is a really good way for predictors to exploit FDT.
Because of this, FDT strikes me as performing better than CDT in a handful of rare scenarios but may overall be subjected to some no-free-lunch theorem that applies specifically to the kind of universe that we are in. I guess that’s what I’m thinking about here.
Oh yes. I agree with what you mean. When I brought up the idea about an agent strategically acting certain ways or overwriting itself to confound the predictions that adversarial predictors, I had in mind that the correlations that such predictors used could be non-mere w.r.t. the reference class of agents these predictors usually deal but still confoundable by our design of the agent and thereby non mere to us.
+1 to this. I agree that this is the right question to be asking, that it depends on a lot of assumptions about how adversarial an environment is, and that FDT does indeed seem to have some key advantages.
Also as a note, sorry for some differences in terminology between this post and the one I linked to on my Medium blog.