Let’s look at a slightly more complicated example:
A patient comes in to the hospital because he’s sick, his vitals are taken (C), and based on C, a doctor prescribes medication A. Sometime later, the patient dies (Y). Say this happens for lots of patients, and we form an empirical distribution p(C,A,Y) from these cases. Things we may want to represent:
marginal probability of death: p(Y)
the policy the doctor followed when prescribing the medicine: p(A|C)
the probability that someone would die given that they were given the medicine: p(Y|A)
the “causal effect” of medicine on death: ???
The issue with the “causal effect” is that the doctor’s policy, if it is sensible, will be more likely to prescribe A to people who are already very sick, and not prescribe A to people who are mostly healthy. Thus it may very well turn out that p(Y|A) is higher than the probability p(Y) (this happens for example with HIV drugs). But surely this doesn’t mean that medicine doesn’t help! What we need is the probability of death given that the doctor arbitrarily decided to give medicine, regardless of background status C. This arbitrary decision “decouples” the influence of the patient’s health status and the influence of the medicine in the sense that if we average over health status for patients after such a decision was made, we would get just the effect of the medicine itself.
For this we need to distinguish an “arbitrary” decision to give medicine, divorced from the policy. Call this arbitrary decision A . You may then claim that we can just use rules of conditional probabilities on events Y,C,A, and A , and we do not need to involve anything else. Of course, if you try to write down sensible axioms that relate A with Y,C,A you will notice that essentially you are writing down standard axioms of potential outcomes (one way of thinking about potential outcomes is they are random variables after a hypothetical “arbitrary decision” like A ). For example, if were to arbitrarily decide to prescribe medicine just precisely for the patients the doctor prescribed medicine to, we would get that p(Y|A) = p(Y|A ) (this is known as the consistency axiom). You will find that your A functions as an intervention do(a).
There is a big literature on how intervention events differ from observation events (they have different axiomatizations, for example). You may choose to use a notation for interventions that does not look different from observations, but the underlying math will be different if you wish to be sensible. That is the point. You need new math beyond math about evidence (which is what probability theory is).
It seems to me that we often treat EDT decisions with some sort of hindsight bias. For instance, given that we know that the action A (turning on sprinklers) doesn’t increase the probability of the outcome O (rain) it looks very foolish to do A. Likewise, a DT that suggests doing A may look foolish.
But isn’t the point here that the deciding agent doesn’t know that? All he knows is, that P(E|A)>P(E) and P(O|E)>P(O). Of course A still might have no or even a negative causal effect on O, but yet we have more reason the believe otherwise. To illustrate that, consider the following scenario:
Imagine you find yourself in a white room with one red button. You have no idea why you’re there and what this is all about. During the first half hour you are undecided whether you should press the button. Finally your curiosity dominates other considerations and you press the button. Immediately you feel a blissful release of happiness hormones. If you use induction it seems plausible to infer that considering certain time intervalls (f.i. of 1 minute) P(bliss|button) > P(bliss). Now the effect has ceased and you wished to be shot up again. Is it now rational to press the button a second time? I would say yes. And I don’t think that this is controversial. And since we can repeat the pattern with further variables it should also work with the example above.
From that point of view it doesn’t seem to be foolish at all—talking about the sprinkler again—to have a non-zero credence in A (turning sprinkler on) increasing the probability of O (rain). In situations with that few knowledge and no further counter-evidence (which f.i. might suggest that A might have no or a negative influence on O) this should lead an agent to do A.
Considering the doctor, again, I think we have to stay clear about what the doctor actually knows. Let’s imagine a doctor who lost all his knowledge about medicine. Now he reads one study which shows that P(Y|A) > P(Y). It seems to me that given that piece of information (and only that!) a rational doctor shouldn’t do A.
However, once he reads the next study he can figure out that C (the trait “cancer”) is confounding the previous assessment because most patients who are treated with A show C as well, whereas ~A don’t. This update (depending on the probability values respectively) will then lead to a shift favoring the action A again.
To summarize: I think many objections against EDT fail once we really clarify what the agent knows respectively. In scenarios with few knowledge EDT seems to give the right answers. Once we add further knowledge an EDT updates his beliefs and won’t turn on the sprinkler in order to increase the probability of rain.
As we know it from the hindsight bias it might be difficult to really imagine what actually would be different if we didn’t know what we do know now.
Maybe that’s all streaked with flaws, so if you find some please hand me over the lottery tickets ; )
There is no problem with what the doctor is doing. The doctor is trying to minimize the number of deaths given that (s)he measures C, as you said.
The question is, how do we quantify what the effect of medicine A on death is? In other words, how do you answer the question “does medicine A help or hurt?” given that you know p(Y,A,C). This is where you don’t want to use p(Y | A). This is because sicker people will die more and get medicine more, hence you might be mislead into thinking that giving people A increases death risk.
Often, but not always (one common issue is the size of C can be very large). Even if you measure all the symptoms, and are interested in the effect of the medicine conditional on these symptoms (what they call “effect modification” in epidemiology) there is the question of confounders you did not measure that would prevent p(Y | A, C) from being equal to the effect you want, which is p(Y | do(A), C).
No, this is not Simpson’s paradox. Or rather, the reason Simpson’s paradox is counterintuitive is precisely the same reason that you should not use conditional probabilities to represent causal effects. That reason is “confounders,” that is common causes of both your conditioned on variable and your outcome.
Simpson’s paradox is a situation where:
P(E|C) > P(E|not C), but
P(E|C,F) < P(E|not C,F) and P(E|C,not F) < P(E|not C, not F).
If instead of conditioning on C, we use “arbitrary decisions” or do(.), we get that if
P(E|do(C),F) < P(E|do(not C),F), and P(E|do(C),not F) < P(E|do(not C), not F), then
P(E|do(C)) < P(E|do(not C))
which is the intuitive conclusion people like. The issue is that the first set of equations is a perfectly consistent set of constraints on a joint density P(E,C,F). However, people want to interpret that set of constraints causally, e.g. as a guide for making a decision on C. But decisions are not evidence, decisions are do(.). Hence you need the second set of equations, which do have the property of “conserving inequalities by averaging” we want.
In my example, the issue was that p(Y|do(A)) was different from p(Y|A) due to the confounding effect of “health status” C. The point is that interventions remove the influence of confounding common causes by “being arbitrary” and not depending on them.
EDT, and more generally standard probability theory, simply fails on causality due to a lack of explicit axiomatization of causal notions like “confounder” or “effect.”
Let’s look at a slightly more complicated example:
A patient comes in to the hospital because he’s sick, his vitals are taken (C), and based on C, a doctor prescribes medication A. Sometime later, the patient dies (Y). Say this happens for lots of patients, and we form an empirical distribution p(C,A,Y) from these cases. Things we may want to represent:
marginal probability of death: p(Y)
the policy the doctor followed when prescribing the medicine: p(A|C)
the probability that someone would die given that they were given the medicine: p(Y|A)
the “causal effect” of medicine on death: ???
The issue with the “causal effect” is that the doctor’s policy, if it is sensible, will be more likely to prescribe A to people who are already very sick, and not prescribe A to people who are mostly healthy. Thus it may very well turn out that p(Y|A) is higher than the probability p(Y) (this happens for example with HIV drugs). But surely this doesn’t mean that medicine doesn’t help! What we need is the probability of death given that the doctor arbitrarily decided to give medicine, regardless of background status C. This arbitrary decision “decouples” the influence of the patient’s health status and the influence of the medicine in the sense that if we average over health status for patients after such a decision was made, we would get just the effect of the medicine itself.
For this we need to distinguish an “arbitrary” decision to give medicine, divorced from the policy. Call this arbitrary decision A . You may then claim that we can just use rules of conditional probabilities on events Y,C,A, and A , and we do not need to involve anything else. Of course, if you try to write down sensible axioms that relate A with Y,C,A you will notice that essentially you are writing down standard axioms of potential outcomes (one way of thinking about potential outcomes is they are random variables after a hypothetical “arbitrary decision” like A ). For example, if were to arbitrarily decide to prescribe medicine just precisely for the patients the doctor prescribed medicine to, we would get that p(Y|A) = p(Y|A ) (this is known as the consistency axiom). You will find that your A functions as an intervention do(a).
There is a big literature on how intervention events differ from observation events (they have different axiomatizations, for example). You may choose to use a notation for interventions that does not look different from observations, but the underlying math will be different if you wish to be sensible. That is the point. You need new math beyond math about evidence (which is what probability theory is).
It seems to me that we often treat EDT decisions with some sort of hindsight bias. For instance, given that we know that the action A (turning on sprinklers) doesn’t increase the probability of the outcome O (rain) it looks very foolish to do A. Likewise, a DT that suggests doing A may look foolish. But isn’t the point here that the deciding agent doesn’t know that? All he knows is, that P(E|A)>P(E) and P(O|E)>P(O). Of course A still might have no or even a negative causal effect on O, but yet we have more reason the believe otherwise. To illustrate that, consider the following scenario:
Imagine you find yourself in a white room with one red button. You have no idea why you’re there and what this is all about. During the first half hour you are undecided whether you should press the button. Finally your curiosity dominates other considerations and you press the button. Immediately you feel a blissful release of happiness hormones. If you use induction it seems plausible to infer that considering certain time intervalls (f.i. of 1 minute) P(bliss|button) > P(bliss). Now the effect has ceased and you wished to be shot up again. Is it now rational to press the button a second time? I would say yes. And I don’t think that this is controversial. And since we can repeat the pattern with further variables it should also work with the example above.
From that point of view it doesn’t seem to be foolish at all—talking about the sprinkler again—to have a non-zero credence in A (turning sprinkler on) increasing the probability of O (rain). In situations with that few knowledge and no further counter-evidence (which f.i. might suggest that A might have no or a negative influence on O) this should lead an agent to do A.
Considering the doctor, again, I think we have to stay clear about what the doctor actually knows. Let’s imagine a doctor who lost all his knowledge about medicine. Now he reads one study which shows that P(Y|A) > P(Y). It seems to me that given that piece of information (and only that!) a rational doctor shouldn’t do A. However, once he reads the next study he can figure out that C (the trait “cancer”) is confounding the previous assessment because most patients who are treated with A show C as well, whereas ~A don’t. This update (depending on the probability values respectively) will then lead to a shift favoring the action A again.
To summarize: I think many objections against EDT fail once we really clarify what the agent knows respectively. In scenarios with few knowledge EDT seems to give the right answers. Once we add further knowledge an EDT updates his beliefs and won’t turn on the sprinkler in order to increase the probability of rain. As we know it from the hindsight bias it might be difficult to really imagine what actually would be different if we didn’t know what we do know now.
Maybe that’s all streaked with flaws, so if you find some please hand me over the lottery tickets ; )
I don’t get the relevance of this p(Y|A).
In EDT, you contition on both action and observations.
In this case, the EDT doctor prescribes argmin_A p(Y | A & C)
What’s the problem with that?
There is no problem with what the doctor is doing. The doctor is trying to minimize the number of deaths given that (s)he measures C, as you said.
The question is, how do we quantify what the effect of medicine A on death is? In other words, how do you answer the question “does medicine A help or hurt?” given that you know p(Y,A,C). This is where you don’t want to use p(Y | A). This is because sicker people will die more and get medicine more, hence you might be mislead into thinking that giving people A increases death risk.
Only if you ignore the symptoms.
In medicine you want answer questions of the type “given symptoms C, does medicine A help or hurt?”
Often, but not always (one common issue is the size of C can be very large). Even if you measure all the symptoms, and are interested in the effect of the medicine conditional on these symptoms (what they call “effect modification” in epidemiology) there is the question of confounders you did not measure that would prevent p(Y | A, C) from being equal to the effect you want, which is p(Y | do(A), C).
I suppose that’s what randomized trials are for.
Or you can read my dissertation if you want to answer these types of questions but can’t randomize :).
That’s Simpson’s paradox. CronoDAS argues that EDT fails at it.
No, this is not Simpson’s paradox. Or rather, the reason Simpson’s paradox is counterintuitive is precisely the same reason that you should not use conditional probabilities to represent causal effects. That reason is “confounders,” that is common causes of both your conditioned on variable and your outcome.
Simpson’s paradox is a situation where:
P(E|C) > P(E|not C), but
P(E|C,F) < P(E|not C,F) and P(E|C,not F) < P(E|not C, not F).
If instead of conditioning on C, we use “arbitrary decisions” or do(.), we get that if
P(E|do(C),F) < P(E|do(not C),F), and P(E|do(C),not F) < P(E|do(not C), not F), then
P(E|do(C)) < P(E|do(not C))
which is the intuitive conclusion people like. The issue is that the first set of equations is a perfectly consistent set of constraints on a joint density P(E,C,F). However, people want to interpret that set of constraints causally, e.g. as a guide for making a decision on C. But decisions are not evidence, decisions are do(.). Hence you need the second set of equations, which do have the property of “conserving inequalities by averaging” we want.
See also: http://bayes.cs.ucla.edu/R264.pdf
In my example, the issue was that p(Y|do(A)) was different from p(Y|A) due to the confounding effect of “health status” C. The point is that interventions remove the influence of confounding common causes by “being arbitrary” and not depending on them.
EDT, and more generally standard probability theory, simply fails on causality due to a lack of explicit axiomatization of causal notions like “confounder” or “effect.”
Your linked pdf does not exist.
Fixed—there was an unintended period at the end. Sorry about that.
Thanks! That’s a really nice summary.