Ok, I still need to actually find a spare hour to sit down and watch that talk of yours, but the more I think about even your words here, the more I agree with you.
I think CDT might well be the correct decision theory. The correlation between Omega’s prediction of us (as represented in TDT or CDT+E) and our actual choice is not a matter of decision-making, it’s a matter of our beliefs about the world. EDT thus wins at Newcomb’s Problem because it uses a full joint probability distribution, handling both correlation and causation, to represent its beliefs, whereas CDT is “losing” because it has no way to represent beliefs about correlation as separate from “pure” causation. Since I’m way behind on learning the math and haven’t studied Judea Pearl’s textbook yet, is there a form of causal graph that either natively includes or can be augmented with bidirectional correlation edges?
In real life, the correlations wouldn’t even have to be “identity functions” (causing two correlated nodes in the graph to take on the exact same value), they could be any form of invertible function learned by any kind of regression analysis.
We could then apply a simple form of causal decision theory in which part of tracing the causal effects of our potential action is to transmit information about our decision across correlation arrows, up and down the causal graph.
Such a theory would then behave like TDT or CDT+E while being much more mathematically powerful in terms of the correlative beliefs it could discover and represent.
Since I’m way behind on learning the math and haven’t studied Judea Pearl’s textbook yet, is there a form of causal
graph that either natively includes or can be augmented with bidirectional correlation edges?
Sure is, but you have to be careful. You can draw whatever type of edge you want, the trick is to carefully define what the particular type of edge means (or to be more precise you have to define what an absence of a particular type of edge means).
Generally Pearl et al. use a bidirected edge A <-> B to mean “there exists some hidden common cause(s) of A and B that I don’t want to bother to draw,” e.g. the real graph is A ← H → B, where H is hidden. Or possibly there are multiple H nodes… Or, again more precisely, the absence of such an edge means there are no such hidden common causes. I use these sorts of graphs in my talk, my papers, my thesis, etc. They are called latent projections in Verma and Pearl 1990, and some people call this type of graph an ADMG (an acyclic directed mixed graph).
I am not entirely clear on what edge you want, maybe you want an edge to denote a deterministic constraint between nodes. That is also possible, I think there is D-separation (capital D) in Dan Geiger’s thesis that handles these. Most of this has been worked out in late 80s early 90s.
Even in a simple 4 node graph you can have different type of correlation structure. For example:
A → B <-> C ← D
denotes an independence model where
A is independent of D
A is independent of C given D
B is independent of D given A
This generally corresponds to a hidden common cause between B and C. (*)
We can also have:
A → B—C ← D
This corresponds to an independence model:
A is independent of D
A is independent of C given B and D
B is independent of D given A and C
This does not correspond to a hidden common cause of B and C, but to an equilibrium distribution of a feedback process between B and C under fixed values A and D. These types of graphs are known as “chain graphs” and were developed by a fellow at Oxford named Steffan Lauritzen.
You may also have something like this:
A → B → S ← C ← D
where S is a common effect of B and C that attains some specific value but isn’t recorded. This corresponds to an independence model
A is independent of C and D given B
D is independent of A and B given C
This case corresponds to outcome dependent sampling (e.g. when people do case-control studies for rare diseases where they select one arm of a trial among those who are already sick—the sample isn’t random). This independent model actually corresponds to an undirected graphical model (Markov random field), because of the way conditioning on a node affects the node’s ancestors in the graph.
(*) But not always. We can set up a quantum mechanical experiment that mirrors the above graph, and then note that in any hidden variable DAG with an H instead of a <-> edge, there is an inequality constraint that must hold on p(A,B,C,D). In fact, this inequality is violated experimentally, which means there is no hidden variable H in quantum mechanics… or some other seemingly innocuous assumption is not right.
So sometimes we can draw <-> simply to denote a conditional independence model that resembles those you get from a DAG with unobserved variables …. except Nature is annoying and doesn’t actually have any underlying DAG.
If you are confused by this, you are in good company! I am still thinking very hard about what this means.
edit: Mysterious comment just for fun: it is sufficient to have a graph with → edges, <-> edges in the Pearl sense, and—edges in the Lauritzen sense that are “closed” with respect to “interesting” operations. “Closed” means we apply an operation and stay in the graph class: DAGs aren’t closed under marginalizations, if we marginalize a DAG we sometimes get something that isn’t a DAG. An “interesting” operation would be like conditioning: we can get independence after conditioning, which reduces the dimension of a model (less parameters needed if there is independence).
So sometimes we can draw <-> simply to denote a conditional independence model that resembles those you get from a DAG with unobserved variables …. except Nature is annoying and doesn’t actually have any underlying DAG.
If you are confused by this, you are in good company! I am still thinking very hard about what this means.
Strangely enough, I’m not confused by it, as until someone reduces quantum mechanics to some lower-level non-quantum physics (which, apparently is something a few people are actually working on), I’ve just gone and accepted that the real causative agent in Nature is a joint probability distribution that is allowed to set a whole tuple of nonlocal outcome variables as it evolves.
But anyway, yes, this means that’s roughly the kind of “correlation arrow” I think should be drawn in a CDT causal graph to handle Newcomblike problems, with CDT being just very slightly modified to actually make use of those correlative arrows in setting its decision.
That would get us at least as far as CDT+E does, while also reducing the problem of discovering the “entanglements” to actually just learning correct beliefs about correlative arrows, hidden variables or no hidden variables.
I would again like to hear what’s going on in the Counterfactual Mugging, as that looks like the first situation we cannot actually beat by learning correct causative and correlative beliefs, and then applying a proper “Causal and Correlative” Decision Theory.
Anyway, sometime this evening or something I’m going to watch your lecture, and email you for the slides.
Ok, I still need to actually find a spare hour to sit down and watch that talk of yours, but the more I think about even your words here, the more I agree with you.
I think CDT might well be the correct decision theory. The correlation between Omega’s prediction of us (as represented in TDT or CDT+E) and our actual choice is not a matter of decision-making, it’s a matter of our beliefs about the world. EDT thus wins at Newcomb’s Problem because it uses a full joint probability distribution, handling both correlation and causation, to represent its beliefs, whereas CDT is “losing” because it has no way to represent beliefs about correlation as separate from “pure” causation. Since I’m way behind on learning the math and haven’t studied Judea Pearl’s textbook yet, is there a form of causal graph that either natively includes or can be augmented with bidirectional correlation edges?
In real life, the correlations wouldn’t even have to be “identity functions” (causing two correlated nodes in the graph to take on the exact same value), they could be any form of invertible function learned by any kind of regression analysis.
We could then apply a simple form of causal decision theory in which part of tracing the causal effects of our potential action is to transmit information about our decision across correlation arrows, up and down the causal graph.
Such a theory would then behave like TDT or CDT+E while being much more mathematically powerful in terms of the correlative beliefs it could discover and represent.
Sure is, but you have to be careful. You can draw whatever type of edge you want, the trick is to carefully define what the particular type of edge means (or to be more precise you have to define what an absence of a particular type of edge means).
Generally Pearl et al. use a bidirected edge A <-> B to mean “there exists some hidden common cause(s) of A and B that I don’t want to bother to draw,” e.g. the real graph is A ← H → B, where H is hidden. Or possibly there are multiple H nodes… Or, again more precisely, the absence of such an edge means there are no such hidden common causes. I use these sorts of graphs in my talk, my papers, my thesis, etc. They are called latent projections in Verma and Pearl 1990, and some people call this type of graph an ADMG (an acyclic directed mixed graph).
I am not entirely clear on what edge you want, maybe you want an edge to denote a deterministic constraint between nodes. That is also possible, I think there is D-separation (capital D) in Dan Geiger’s thesis that handles these. Most of this has been worked out in late 80s early 90s.
Even in a simple 4 node graph you can have different type of correlation structure. For example:
A → B <-> C ← D
denotes an independence model where
A is independent of D
A is independent of C given D
B is independent of D given A
This generally corresponds to a hidden common cause between B and C. (*)
We can also have:
A → B—C ← D
This corresponds to an independence model:
A is independent of D
A is independent of C given B and D
B is independent of D given A and C
This does not correspond to a hidden common cause of B and C, but to an equilibrium distribution of a feedback process between B and C under fixed values A and D. These types of graphs are known as “chain graphs” and were developed by a fellow at Oxford named Steffan Lauritzen.
You may also have something like this:
A → B → S ← C ← D
where S is a common effect of B and C that attains some specific value but isn’t recorded. This corresponds to an independence model
A is independent of C and D given B
D is independent of A and B given C
This case corresponds to outcome dependent sampling (e.g. when people do case-control studies for rare diseases where they select one arm of a trial among those who are already sick—the sample isn’t random). This independent model actually corresponds to an undirected graphical model (Markov random field), because of the way conditioning on a node affects the node’s ancestors in the graph.
(*) But not always. We can set up a quantum mechanical experiment that mirrors the above graph, and then note that in any hidden variable DAG with an H instead of a <-> edge, there is an inequality constraint that must hold on p(A,B,C,D). In fact, this inequality is violated experimentally, which means there is no hidden variable H in quantum mechanics… or some other seemingly innocuous assumption is not right.
So sometimes we can draw <-> simply to denote a conditional independence model that resembles those you get from a DAG with unobserved variables …. except Nature is annoying and doesn’t actually have any underlying DAG.
If you are confused by this, you are in good company! I am still thinking very hard about what this means.
edit: Mysterious comment just for fun: it is sufficient to have a graph with → edges, <-> edges in the Pearl sense, and—edges in the Lauritzen sense that are “closed” with respect to “interesting” operations. “Closed” means we apply an operation and stay in the graph class: DAGs aren’t closed under marginalizations, if we marginalize a DAG we sometimes get something that isn’t a DAG. An “interesting” operation would be like conditioning: we can get independence after conditioning, which reduces the dimension of a model (less parameters needed if there is independence).
Strangely enough, I’m not confused by it, as until someone reduces quantum mechanics to some lower-level non-quantum physics (which, apparently is something a few people are actually working on), I’ve just gone and accepted that the real causative agent in Nature is a joint probability distribution that is allowed to set a whole tuple of nonlocal outcome variables as it evolves.
But anyway, yes, this means that’s roughly the kind of “correlation arrow” I think should be drawn in a CDT causal graph to handle Newcomblike problems, with CDT being just very slightly modified to actually make use of those correlative arrows in setting its decision.
That would get us at least as far as CDT+E does, while also reducing the problem of discovering the “entanglements” to actually just learning correct beliefs about correlative arrows, hidden variables or no hidden variables.
I would again like to hear what’s going on in the Counterfactual Mugging, as that looks like the first situation we cannot actually beat by learning correct causative and correlative beliefs, and then applying a proper “Causal and Correlative” Decision Theory.
Anyway, sometime this evening or something I’m going to watch your lecture, and email you for the slides.