(a) You are actually using causal graphs. Show me a single accepted definition of evidential decision theory that allows you to do that (or more precisely that defines, as a part of its decision rule definition, what a causal graph is).
(b) You have to somehow be able to make decisions in the real world. What sort of data do you need to be able to apply your decision rule, and what is the algorithm that gives you your rule given this data?
(a) Well, you don’t really a need a causal graph; a probability distribution for the agent’s situation will do. Although it might be convenient to represent it as a causal graph. Where I have described the use of causal graphs above, they are merely a component of the reasoning used to infer your probability distribution within probability theory.
(b) That is, a set of hypotheses you might consider would include G = “the phenomenon I am looking at behaves in a manner described by graph G”. Then you calculate the posterior probability P(G | data) × the joint distribution over the variables of the agent’s situation given G, and integrate over G to get the posterior distribution for the agent’s situation.
Given that, you decide what to do based on expected utility with P(outcome | action, data). Obviously, the above calculation is highly nontrivial. In principle you could just use some universal prior (ie. Solomonoff induction) to calculate the posterior distribution for the agent instead, but that’s even less practical.
In practice you can often approximate this whole process fairly well by assuming the only difference between our situation and the data to be that our decision is uncorrelated with whatever decision procedure was used in the data, and treating it as an “intervention” (which I think might correspond to just using the most likely G, and ignoring all other hypotheses).
(a) Well, you don’t really a need a causal graph; a probability distribution for the agent’s situation will do.
Although it might be convenient to represent it as a causal graph. Where I have described the use of causal
graphs above, they are merely a component of the reasoning used to infer your probability distribution
within probability theory.
Well, you have two problems here. The first (and bigger) problem is you are committing an ontological error (probability distributions are not about causality but about uncertainty. It doesn’t matter if you are B or F about it). The second (smaller, but still significant) problem is that probability distributions by themselves do not contain the information that you want. In other words, you don’t get identifiability of the causal effect in general if all you are given is a probability distribution. To use a metaphor Judea likes to use, if you have a complete surface description of how light reflection works on an item (say a cup), you can construct a computer graphics engine that can render the cup from any angle. But there is no information on how the cup is to be rendered under deformation (that is, if I smash the cup on the table, what will it look like?)
Observed joint probability distributions—surface information, interventional distributions—information after deformations. It might be informative to consider how your (Bayesian?) procedure would work in the cup example. The analogy is almost exact, the set of interventional densities is a much bigger set than the set of observed joint distributions.
I would be very interested in what you think the right decision rule is for my 5 node HAART example.
In my example you don’t have to average over possible graphs, because my hypothetical is that we know what the correct graph is (and what the correct corresponding distribution is).
Presumably your answer will take the form of either [decision rule given some joint probability distribution that does not mention any causal language] or “not enough information for an answer.”
If your answer is the latter, your decision theory is not very good. If the former (and by some miracle the decision rule gives the right answer), I would be very interested in a (top level?) post that works out how you recover the correct properties of causal graphs from just probability distributions. If correct, you could easily publish this in any top statistics journal and revolutionize the field. My intuition is that 100 years of statistics is not in fact wrong, and as you start dealing with more and more complex problems (I can generate an inexhaustible list of these), there will come up a lot of “gotchas” that causal folks already dealt with. In order to deal with these “gotchas” you will have to modify and modify your proposal until effectively you just reinvent intervention calculus.
I would be very interested in what you think the right decision rule is for my 5 node HAART example. In my example you don’t have to average over possible graphs, because my hypothetical is that we know what the correct graph is (and what the correct corresponding distribution is).
Your graph describes the data generation stochastic process. The agent needs a different one to model the situation it is facing. If it uses the right graph (or more generally, the right joint probability distribution, which doesn’t have to be factorizable), then it will get the right answer.
How to go from a set of data and a model of the data generation process to a model of the agent situation process is, of course, a non trivial problem, but it is not part of the agent decision problem.
Ok, these problems I am posing are not abstract, they are concrete problems in medical decision making. In light of http://lesswrong.com/lw/jco/examples_in_mathematics/, I am going to pose 4 of them, right here, then tell you what the right answer is, and what assumptions I used to get this answer. Whatever decision theory you are using needs to be able to correctly represent and solve these problems, using at most the information that my solutions use, or it is not a very good decision theory (in the sense that there exist known alternatives that do solve these problems correctly). In all problems our utility penalizes patient deaths, or patients getting a disease.
In particular, if you are a user of EDT, you need to give me a concrete algorithm that correctly solves all these problems, without using any causal vocabulary. You can’t just beg off on how it’s a “non trivial problem.” These are decision problems, and there exist solutions for these problems now, using CDT! Can EDT solve them or not? I have yet to see anyone try to seriously engage these (with the notable exception of Paul, who to his credit did try to give a Bayesian/non-causal account of problem 3, but ran out of time).
Note : I am assuming the correct graph, and lots of samples, so the effect of a prior is not significant (and thus talk about empirical frequencies, e.g. p(c)). If we wanted to, we could do a Bayesian problem over possible causal graphs, with a prior, and/or a Bayesian problem for estimation, where we could talk about, for example, the posterior distribution of case histories C. I skipped all that to simplify the examples.
Problem 1:
We perform a randomized control trial for a drug (half the patients get the drug, half the patients do not). Some of the patients die (in both groups). Let A be a random variable representing whether the patient in our RCT dataset got the drug, and Y be a random variable representing whether the patient in our RCT dataset died. A new patient comes in which is from the same cohort as those in our RCT. Should we give them the drug?
Solution: Give the drug if and only if E[Y = yes | A = yes] < E[Y = yes | A = no].
Intuition for why this is correct: since we randomized the drug, there are no possible confounders between drug use and death. So any dependence between the drug and death is causal. So we can just look at conditional correlations.
Assumptions used: we need the empirical p(A,Y) from our RCT, and the assumption that the correct causal graph is A → Y. No other assumptions needed.
Ideas here: you should be able to transfer information you learn from observed members in a group to others members of the same group. Otherwise, what is stats even doing?
Problem 2:
We perform an observational study, where doctors assign (or not) a drug based on observed patient vitals recorded in their case history file. Some of the patients die. Let A be a random variable representing whether the patient in our dataset from our study got the drug, Y be a random variable representing whether the patient in our study died, and C be the random variable representing the patient vitals used by the doctors to decide whether to give the drug or not. A new patient comes in which is from the same cohort as those in our study. If we do not get any additional information on this patient, should we give them the drug?
Solution: Give the drug if and only if \sum{c} E[Y = yes | A = yes, c] p(c) < \sum{c} E[Y = yes | A = no, c] p(c)
Intuition for why this is correct: we have not randomized the drug, but we recorded all the info doctors used to decide on whether to give the drug. Since case history C represents all possible confounding between A and Y, conditional on knowing C, any dependence between A and Y is causal. In other words, E[Y | A, C] gives a causal dependence of A and Y, conditional on C. But since we are not allowed to measure anything about the incoming patient, we have to average over the possible case histories the patient might have. Since the patient is postulated to have come from the same dataset as those in our study, it is reasonable to average over the observed case histories in our study. This recovers the above formula.
Assumptions used: we need the empirical p(A,C,Y) from our study, and the assumption that the correct causal graph is C → A → Y, C → Y. No other assumptions needed.
Ideas here: this is isomorphic to the smoking lesion problem. The idea here is you can’t use observed correlations if there are confounders, you have to adjust for confounders properly using the g-formula (the formula in the answer).
Problem 3:
We perform a partially randomized and partially observational longitudinal study, where patients are randomly assigned (or not) a drug at time 0, then their vitals at time 1 are recorded in a file, and based on those vitals, and the treatment assignment history at time 0, doctors may (or not), decide to give them more of the drug. Afterwards, at time 2, some patients die (or not). Let A0 be a random variable representing whether the patient in our dataset from our study got the drug at time 0, A1 be a random variable representing whether the patient in our dataset from our study got the drug at time 1, Y be a random variable representing whether the patient in our study died, and C be the random variable representing the case history used by the doctors to decide whether to give the drug or not at time 1. A new patient comes in which is from the same cohort as those in our study. If we do not get any additional information on this patient, should we give them the drug, and if so at what time points?
Solution: Use the drug assignment policy (a0,a1) that minimizes \sum{c} E[Y = yes | A1 = a1, c, A0 = a0] p(c | A0 = a0).
Intuition for why this is correct: we have randomized A0, but have not randomized A1, and we are interested in the joint effect of both A0 and A1 on Y. We know C is a confounder for A1, so we have to adjust for it somehow as in Problem 2, otherwise an observed dependence of A1 and Y will contain a non-causal component through C. However, C is not a confounder for the relationship of A0 and Y. Conditional on A0 and C, the relationship between A1 and Y is entirely causal, so E[Y | A1, C, A0] is a causal quantity. However, for the incoming patient, we are not allowed to measure C, so we have to average over C as before in problem 2. However, in our case C is an effect of A0, which means we can’t just average the base rates for case histories, we have to take into account what happened at time 0, in other words the causal effect of A0 on C. Because in our graph, there are no confounders between A0 and C, the causal relationship can be represented by p(C | A0) (no confounders means correlation equals causation). Since A0 also has no confounders for Y, E[Y | A1, C, A0], weighted by p(C | A0) gives us the right causal relationship between {A0,A1} and Y.
Assumptions used: we need the empirical p(A0,C,A1,Y) from our study, and the assumption that the correct causal graph is A0 → C → A1 → Y, A0 → A1, A0 → Y, and we possibly allow that there is an unrestricted hidden variable U that is a parent of both C and Y. No other assumptions needed.
Ideas here: simply knowing that you have confounders is not enough, you have to pay attention to the precise causal relationships to figure out what the right thing to do is. In this case, C is a ‘time-varying confounder,’ and requires a more complicated adjustment that takes into account that the confounder is also an effect of an earlier treatment.
Problem 4:
We consider a (hypothetical) observational study of coprophagic treatment of stomach cancer. It is known (for the purposes of this hypothetical example) that coprophagia’s protective effect vs cancer is due to the presence of certain types of intestinal flora in feces. At the same time, people who engage in coprophagic behavior naturally are not a random sampling of the population, and therefore may be more likely than average to end up with stomach cancer. Let A be a random variable representing whether those in our study engaged in coprophagic behavior, let W be the random variable representing the presence of beneficial intestinal flora, let Y be the random variable representing the presence of stomach cancer, and let U be some unrestricted hidden variable which may influence both coprophagia and stomach cancer. A new patient at risk for stomach cancer comes in which is from the same cohort as those in our study. If we do not get any additional information on this patient, should we give them the coprophagic treatment as a preventative measure?
Solution: Yes, if and only if \sum{w} p(W = w | A = yes) \sum{a} E[Y = yes | W = w, A = a) p(A = a) < \sum{w} p(W = w | A = no) \sum{a} E[Y = yes | W = w, A = a) p(A = a)
Intuition for why this is correct: since W is independent of confounders for A and Y, and A only affects Y through W, the effect of A on Y decomposes/factorizes into an effect of A on W, and an effect of W on Y, averaged over possible values W could take. The effect of A on W is not confounded by anything, and so is equal to p(W | A). The effect of W on Y is confounder by A, but given our assumptions, conditioning on A is sufficient to remove all confounding for the effect, which gives us \sum{A} p(Y | W,A) p(A). This gives above formula.
Assumptions used: we need the empirical p(A,C,Y) from our study, and the assumption that the correct causal graph is A → W → Y, and there is an unrestricted hidden variable U that is a parent of both A and Y. No other assumptions needed.
Ideas here: sometimes your independences let you factorize effects into other effects, similarly to how Bayesian networks factorize. This lets you solve problems that might seem unsolvable due to the presence of unobserved confounding.
The first (and bigger) problem is you are committing an ontological error (probability distributions are not about causality but about uncertainty. It doesn’t matter if you are B or F about it).
I don’t know what you mean by this. Probability distributions can be about whatever you want — it makes perfect sense to speak of “the probability that the cause of X is Y, given some evidence”.
Fine, but then:
(a) You are actually using causal graphs. Show me a single accepted definition of evidential decision theory that allows you to do that (or more precisely that defines, as a part of its decision rule definition, what a causal graph is).
(b) You have to somehow be able to make decisions in the real world. What sort of data do you need to be able to apply your decision rule, and what is the algorithm that gives you your rule given this data?
(a) Well, you don’t really a need a causal graph; a probability distribution for the agent’s situation will do. Although it might be convenient to represent it as a causal graph. Where I have described the use of causal graphs above, they are merely a component of the reasoning used to infer your probability distribution within probability theory.
(b) That is, a set of hypotheses you might consider would include G = “the phenomenon I am looking at behaves in a manner described by graph G”. Then you calculate the posterior probability
P(G | data)
× the joint distribution over the variables of the agent’s situation given G, and integrate over G to get the posterior distribution for the agent’s situation.Given that, you decide what to do based on expected utility with
P(outcome | action, data)
. Obviously, the above calculation is highly nontrivial. In principle you could just use some universal prior (ie. Solomonoff induction) to calculate the posterior distribution for the agent instead, but that’s even less practical.In practice you can often approximate this whole process fairly well by assuming the only difference between our situation and the data to be that our decision is uncorrelated with whatever decision procedure was used in the data, and treating it as an “intervention” (which I think might correspond to just using the most likely G, and ignoring all other hypotheses).
Well, you have two problems here. The first (and bigger) problem is you are committing an ontological error (probability distributions are not about causality but about uncertainty. It doesn’t matter if you are B or F about it). The second (smaller, but still significant) problem is that probability distributions by themselves do not contain the information that you want. In other words, you don’t get identifiability of the causal effect in general if all you are given is a probability distribution. To use a metaphor Judea likes to use, if you have a complete surface description of how light reflection works on an item (say a cup), you can construct a computer graphics engine that can render the cup from any angle. But there is no information on how the cup is to be rendered under deformation (that is, if I smash the cup on the table, what will it look like?)
Observed joint probability distributions—surface information, interventional distributions—information after deformations. It might be informative to consider how your (Bayesian?) procedure would work in the cup example. The analogy is almost exact, the set of interventional densities is a much bigger set than the set of observed joint distributions.
I would be very interested in what you think the right decision rule is for my 5 node HAART example. In my example you don’t have to average over possible graphs, because my hypothetical is that we know what the correct graph is (and what the correct corresponding distribution is).
Presumably your answer will take the form of either [decision rule given some joint probability distribution that does not mention any causal language] or “not enough information for an answer.”
If your answer is the latter, your decision theory is not very good. If the former (and by some miracle the decision rule gives the right answer), I would be very interested in a (top level?) post that works out how you recover the correct properties of causal graphs from just probability distributions. If correct, you could easily publish this in any top statistics journal and revolutionize the field. My intuition is that 100 years of statistics is not in fact wrong, and as you start dealing with more and more complex problems (I can generate an inexhaustible list of these), there will come up a lot of “gotchas” that causal folks already dealt with. In order to deal with these “gotchas” you will have to modify and modify your proposal until effectively you just reinvent intervention calculus.
Your graph describes the data generation stochastic process. The agent needs a different one to model the situation it is facing. If it uses the right graph (or more generally, the right joint probability distribution, which doesn’t have to be factorizable), then it will get the right answer.
How to go from a set of data and a model of the data generation process to a model of the agent situation process is, of course, a non trivial problem, but it is not part of the agent decision problem.
Ok, these problems I am posing are not abstract, they are concrete problems in medical decision making. In light of http://lesswrong.com/lw/jco/examples_in_mathematics/, I am going to pose 4 of them, right here, then tell you what the right answer is, and what assumptions I used to get this answer. Whatever decision theory you are using needs to be able to correctly represent and solve these problems, using at most the information that my solutions use, or it is not a very good decision theory (in the sense that there exist known alternatives that do solve these problems correctly). In all problems our utility penalizes patient deaths, or patients getting a disease.
In particular, if you are a user of EDT, you need to give me a concrete algorithm that correctly solves all these problems, without using any causal vocabulary. You can’t just beg off on how it’s a “non trivial problem.” These are decision problems, and there exist solutions for these problems now, using CDT! Can EDT solve them or not? I have yet to see anyone try to seriously engage these (with the notable exception of Paul, who to his credit did try to give a Bayesian/non-causal account of problem 3, but ran out of time).
Note : I am assuming the correct graph, and lots of samples, so the effect of a prior is not significant (and thus talk about empirical frequencies, e.g. p(c)). If we wanted to, we could do a Bayesian problem over possible causal graphs, with a prior, and/or a Bayesian problem for estimation, where we could talk about, for example, the posterior distribution of case histories C. I skipped all that to simplify the examples.
Problem 1:
We perform a randomized control trial for a drug (half the patients get the drug, half the patients do not). Some of the patients die (in both groups). Let A be a random variable representing whether the patient in our RCT dataset got the drug, and Y be a random variable representing whether the patient in our RCT dataset died. A new patient comes in which is from the same cohort as those in our RCT. Should we give them the drug?
Solution: Give the drug if and only if E[Y = yes | A = yes] < E[Y = yes | A = no].
Intuition for why this is correct: since we randomized the drug, there are no possible confounders between drug use and death. So any dependence between the drug and death is causal. So we can just look at conditional correlations.
Assumptions used: we need the empirical p(A,Y) from our RCT, and the assumption that the correct causal graph is A → Y. No other assumptions needed.
Ideas here: you should be able to transfer information you learn from observed members in a group to others members of the same group. Otherwise, what is stats even doing?
Problem 2:
We perform an observational study, where doctors assign (or not) a drug based on observed patient vitals recorded in their case history file. Some of the patients die. Let A be a random variable representing whether the patient in our dataset from our study got the drug, Y be a random variable representing whether the patient in our study died, and C be the random variable representing the patient vitals used by the doctors to decide whether to give the drug or not. A new patient comes in which is from the same cohort as those in our study. If we do not get any additional information on this patient, should we give them the drug?
Solution: Give the drug if and only if \sum{c} E[Y = yes | A = yes, c] p(c) < \sum{c} E[Y = yes | A = no, c] p(c)
Intuition for why this is correct: we have not randomized the drug, but we recorded all the info doctors used to decide on whether to give the drug. Since case history C represents all possible confounding between A and Y, conditional on knowing C, any dependence between A and Y is causal. In other words, E[Y | A, C] gives a causal dependence of A and Y, conditional on C. But since we are not allowed to measure anything about the incoming patient, we have to average over the possible case histories the patient might have. Since the patient is postulated to have come from the same dataset as those in our study, it is reasonable to average over the observed case histories in our study. This recovers the above formula.
Assumptions used: we need the empirical p(A,C,Y) from our study, and the assumption that the correct causal graph is C → A → Y, C → Y. No other assumptions needed.
Ideas here: this is isomorphic to the smoking lesion problem. The idea here is you can’t use observed correlations if there are confounders, you have to adjust for confounders properly using the g-formula (the formula in the answer).
Problem 3:
We perform a partially randomized and partially observational longitudinal study, where patients are randomly assigned (or not) a drug at time 0, then their vitals at time 1 are recorded in a file, and based on those vitals, and the treatment assignment history at time 0, doctors may (or not), decide to give them more of the drug. Afterwards, at time 2, some patients die (or not). Let A0 be a random variable representing whether the patient in our dataset from our study got the drug at time 0, A1 be a random variable representing whether the patient in our dataset from our study got the drug at time 1, Y be a random variable representing whether the patient in our study died, and C be the random variable representing the case history used by the doctors to decide whether to give the drug or not at time 1. A new patient comes in which is from the same cohort as those in our study. If we do not get any additional information on this patient, should we give them the drug, and if so at what time points?
Solution: Use the drug assignment policy (a0,a1) that minimizes \sum{c} E[Y = yes | A1 = a1, c, A0 = a0] p(c | A0 = a0).
Intuition for why this is correct: we have randomized A0, but have not randomized A1, and we are interested in the joint effect of both A0 and A1 on Y. We know C is a confounder for A1, so we have to adjust for it somehow as in Problem 2, otherwise an observed dependence of A1 and Y will contain a non-causal component through C. However, C is not a confounder for the relationship of A0 and Y. Conditional on A0 and C, the relationship between A1 and Y is entirely causal, so E[Y | A1, C, A0] is a causal quantity. However, for the incoming patient, we are not allowed to measure C, so we have to average over C as before in problem 2. However, in our case C is an effect of A0, which means we can’t just average the base rates for case histories, we have to take into account what happened at time 0, in other words the causal effect of A0 on C. Because in our graph, there are no confounders between A0 and C, the causal relationship can be represented by p(C | A0) (no confounders means correlation equals causation). Since A0 also has no confounders for Y, E[Y | A1, C, A0], weighted by p(C | A0) gives us the right causal relationship between {A0,A1} and Y.
Assumptions used: we need the empirical p(A0,C,A1,Y) from our study, and the assumption that the correct causal graph is A0 → C → A1 → Y, A0 → A1, A0 → Y, and we possibly allow that there is an unrestricted hidden variable U that is a parent of both C and Y. No other assumptions needed.
Ideas here: simply knowing that you have confounders is not enough, you have to pay attention to the precise causal relationships to figure out what the right thing to do is. In this case, C is a ‘time-varying confounder,’ and requires a more complicated adjustment that takes into account that the confounder is also an effect of an earlier treatment.
Problem 4:
We consider a (hypothetical) observational study of coprophagic treatment of stomach cancer. It is known (for the purposes of this hypothetical example) that coprophagia’s protective effect vs cancer is due to the presence of certain types of intestinal flora in feces. At the same time, people who engage in coprophagic behavior naturally are not a random sampling of the population, and therefore may be more likely than average to end up with stomach cancer. Let A be a random variable representing whether those in our study engaged in coprophagic behavior, let W be the random variable representing the presence of beneficial intestinal flora, let Y be the random variable representing the presence of stomach cancer, and let U be some unrestricted hidden variable which may influence both coprophagia and stomach cancer. A new patient at risk for stomach cancer comes in which is from the same cohort as those in our study. If we do not get any additional information on this patient, should we give them the coprophagic treatment as a preventative measure?
Solution: Yes, if and only if \sum{w} p(W = w | A = yes) \sum{a} E[Y = yes | W = w, A = a) p(A = a) < \sum{w} p(W = w | A = no) \sum{a} E[Y = yes | W = w, A = a) p(A = a)
Intuition for why this is correct: since W is independent of confounders for A and Y, and A only affects Y through W, the effect of A on Y decomposes/factorizes into an effect of A on W, and an effect of W on Y, averaged over possible values W could take. The effect of A on W is not confounded by anything, and so is equal to p(W | A). The effect of W on Y is confounder by A, but given our assumptions, conditioning on A is sufficient to remove all confounding for the effect, which gives us \sum{A} p(Y | W,A) p(A). This gives above formula.
Assumptions used: we need the empirical p(A,C,Y) from our study, and the assumption that the correct causal graph is A → W → Y, and there is an unrestricted hidden variable U that is a parent of both A and Y. No other assumptions needed.
Ideas here: sometimes your independences let you factorize effects into other effects, similarly to how Bayesian networks factorize. This lets you solve problems that might seem unsolvable due to the presence of unobserved confounding.
I don’t know what you mean by this. Probability distributions can be about whatever you want — it makes perfect sense to speak of “the probability that the cause of X is Y, given some evidence”.