See, its things like this that makes people have the negative opinion of LW as a quasi-religion that they do. I am willing to wager a guess that your understanding of “the parametric g-formula” is actually based on a google search or two. Yet despite this, you are willing to make (dogmatic, dismissive, and wrong) Bayesian-sounding pronouncements about it. In fact the g-formula is just how you link do(.) and observational data, nothing more nothing less. do(.) is defined in terms of the g-formula in Pearl’s chapter 1.
You’re probably right. Not that this matters much. The reason I said that is because the few papers I could find on the g-formula were all in the context of using it to find out “whether HAART kills people”, and none of them gave any kind of justification or motivation for it, or even mentioned how it related to probabilities involving do().
No. EDT is not allowed to talk about “confounders” or “causes” or “do(.)”. There is nothing in any definition of EDT in any textbook that allows you to refer to anything that isn’t a function of the observed joint density.
Did you read what I wrote? Since action and outcome do not have any common causes (conditional on observations), P(outcome | action, observations) = P(outcome | do(action), observations). I am well aware that EDT does not mention do. This does not change the fact that this equality holds in this particular situation, which is what allows me to say that EDT and CDT have the same answer here.
Re: principle of charity, it’s very easy to get causal questions wrong.
Postulating “just count up how many samples have the particular action and outcome, and ignore everything else” as a decision theory is not a complicated causal mistake. This was the whole point of the hamster example. This method breaks horribly on the most simple dataset with a bit of irrelevant data.
ETA: [responding to your edit]
My understanding of EDT is that the condition would be:
Give HAART at A0,A1 iff E[death | A0=yes, A1=yes] < E[death | A0=no, A1=no]
No, this is completely wrong, because this ignores the fact that the action the EDT agent considers is “I (EDT agent) give this person HAART”, not “be a person who decides whether to give HAART based on metrics L0, and also give this person HAART” which isn’t something it’s possible to “decide” at all.
Since action and outcome do not have any common causes (conditional on observations), P(outcome |
action, observations) = P(outcome | do(action), observations).
In my example, A0 has no causes (it is randomized) but A1 has a common cause with the outcome Y (this common cause is the unobserved health status, which is a parent of both Y and L0, and L0 is a parent of A1). L0 is observed but you cannot adjust for it either because that screws up the effect of A0.
To get the right answer here, you need a causal theory that connects observations to causal effects. The point is, EDT isn’t allowed to just steal causal theory to get its answer without becoming a causal decision theory itself.
In my example, A0 has no causes (it is randomized) but A1 has a common cause with the outcome Y (this common cause is the unobserved health status, which is a parent of both Y and L0, and L0 is a parent of A1). L0 is observed but you cannot adjust for it either because that screws up the effect of A0.
Health status is screened off by the fact that L0 is an observation. At the point where you (EDT agent) decide whether to give HAART at A1 the relevant probability for purposes of calculating expected utility is P(outcome=Y | action=give-haart, observations=[L0, this dataset]). Effect of action on unobserved health-status and through to Y is screened off by conditioning on L0.
That’s right, but as I said, you cannot just condition on L0 because that blocks the causal path from A0 to Y, and opens a non-causal path A0 → L0 <-> Y. This is what makes L0 a “time dependent confounder” and this is why
\sum_{L0} E[Y | L0,A0,A1] p(L0) and E[Y | L0, A0, A1] are both wrong here.
(Remember, HAART is given in two stages, A0 and A1, separated by L0).
That’s right, but as I said, you cannot just condition on L0 because that blocks the causal path from A0 to Y, and opens a non-causal path A0 → L0 <-> Y.
Okay, this isn’t actually a problem. At A1 (deciding whether to give HAART at time t=1) you condition on L0 because you’ve observed it. This means using P(outcome=Y | action=give-haart-at-A1, observations=[L0, the dataset]) which happens to be identical to P(outcome=Y | do(action=give-haart-at-A1), observations=[L0, the dataset]), since A1 has no parents apart from L0. So the decision is the same as CDT at A1.
At A0 (deciding whether to give HAART at time t=0), you haven’t measured L0, so you don’t condition on it. You use P(outcome=Y | action=give-haart-at-A0, observations=[the dataset]) which happens to be the same as P(outcome=Y | do(action=give-haart-at-A0), observations=[the dataset]) since A0 has no parents at all. The decision is the same as CDT at A0, as well.
To make this perfectly clear, what I am doing here is replacing the agents at A0 and A1 (that decide whether to administer HAART) with EDT agents with access to the aforementioned dataset and calculating what they would do. That is, “You are at A0. Decide whether to administer HAART using EDT.” and “You are at A1. You have observed L0=[...]. Decide whether to administer HAART using EDT.”. The decisions about what to do at A0 and A1 are calculated separately (though the agent at A0 will generally need to know, and therefore to first calculate what A1 will do, so that they can calculate stuff like P(outcome=Y | action=give-haart-at-A0, observations=[the dataset])).
You may actually be thinking of “solve this problem using EDT” as “using EDT, derive the best (conditional) policy for agents at A0 and A1″, which means an EDT agent standing “outside the problem”, deciding upon what A0 and A1 should do ahead of time, which works somewhat differently — happily, though, it’s practically trivial to show that this EDT agent’s decision would be the same as CDT’s: because an agent deciding on a policy for A0 and A1 ahead of time is affected by nothing except the original dataset, which is of course the input (an observation), we have P(outcome | do(policy), observations=dataset) = P(outcome | policy, observations=dataset). In case it’s not obvious, the graph for this case is dataset -> (agent chooses policy) -> (some number of people die after assigning A0,A1 based on policy) -> outcome.
You’re probably right. Not that this matters much. The reason I said that is because the few papers I could find on the g-formula were all in the context of using it to find out “whether HAART kills people”, and none of them gave any kind of justification or motivation for it, or even mentioned how it related to probabilities involving do().
Did you read what I wrote? Since
action
andoutcome
do not have any common causes (conditional onobservations
),P(outcome | action, observations) = P(outcome | do(action), observations)
. I am well aware that EDT does not mentiondo
. This does not change the fact that this equality holds in this particular situation, which is what allows me to say that EDT and CDT have the same answer here.Postulating “just count up how many samples have the particular action and outcome, and ignore everything else” as a decision theory is not a complicated causal mistake. This was the whole point of the hamster example. This method breaks horribly on the most simple dataset with a bit of irrelevant data.
ETA: [responding to your edit]
No, this is completely wrong, because this ignores the fact that the action the EDT agent considers is “I (EDT agent) give this person HAART”, not “be a person who decides whether to give HAART based on metrics L0, and also give this person HAART” which isn’t something it’s possible to “decide” at all.
Thanks for this. Technical issue:
In my example, A0 has no causes (it is randomized) but A1 has a common cause with the outcome Y (this common cause is the unobserved health status, which is a parent of both Y and L0, and L0 is a parent of A1). L0 is observed but you cannot adjust for it either because that screws up the effect of A0.
To get the right answer here, you need a causal theory that connects observations to causal effects. The point is, EDT isn’t allowed to just steal causal theory to get its answer without becoming a causal decision theory itself.
Health status is screened off by the fact that L0 is an observation. At the point where you (EDT agent) decide whether to give HAART at A1 the relevant probability for purposes of calculating expected utility is
P(outcome=Y | action=give-haart, observations=[L0, this dataset])
. Effect ofaction
on unobservedhealth-status
and through toY
is screened off by conditioning on L0.That’s right, but as I said, you cannot just condition on L0 because that blocks the causal path from A0 to Y, and opens a non-causal path A0 → L0 <-> Y. This is what makes L0 a “time dependent confounder” and this is why
\sum_{L0} E[Y | L0,A0,A1] p(L0) and E[Y | L0, A0, A1] are both wrong here.
(Remember, HAART is given in two stages, A0 and A1, separated by L0).
Okay, this isn’t actually a problem. At A1 (deciding whether to give HAART at time t=1) you condition on L0 because you’ve observed it. This means using
P(outcome=Y | action=give-haart-at-A1, observations=[L0, the dataset])
which happens to be identical toP(outcome=Y | do(action=give-haart-at-A1), observations=[L0, the dataset])
, since A1 has no parents apart from L0. So the decision is the same as CDT at A1.At A0 (deciding whether to give HAART at time t=0), you haven’t measured L0, so you don’t condition on it. You use
P(outcome=Y | action=give-haart-at-A0, observations=[the dataset])
which happens to be the same asP(outcome=Y | do(action=give-haart-at-A0), observations=[the dataset])
since A0 has no parents at all. The decision is the same as CDT at A0, as well.To make this perfectly clear, what I am doing here is replacing the agents at A0 and A1 (that decide whether to administer HAART) with EDT agents with access to the aforementioned dataset and calculating what they would do. That is, “You are at A0. Decide whether to administer HAART using EDT.” and “You are at A1. You have observed L0=[...]. Decide whether to administer HAART using EDT.”. The decisions about what to do at A0 and A1 are calculated separately (though the agent at A0 will generally need to know, and therefore to first calculate what A1 will do, so that they can calculate stuff like
P(outcome=Y | action=give-haart-at-A0, observations=[the dataset])
).You may actually be thinking of “solve this problem using EDT” as “using EDT, derive the best (conditional) policy for agents at A0 and A1″, which means an EDT agent standing “outside the problem”, deciding upon what A0 and A1 should do ahead of time, which works somewhat differently — happily, though, it’s practically trivial to show that this EDT agent’s decision would be the same as CDT’s: because an agent deciding on a policy for A0 and A1 ahead of time is affected by nothing except the original dataset, which is of course the input (an observation), we have
P(outcome | do(policy), observations=dataset) = P(outcome | policy, observations=dataset)
. In case it’s not obvious, the graph for this case isdataset -> (agent chooses policy) -> (some number of people die after assigning A0,A1 based on policy) -> outcome
.