Look, HIV patients who get HAART die more often (because people who get HAART are already very sick). We don’t get to see the health status confounder because we don’t get to observe everything we want. Given this, is HAART in fact killing people, or not?
It is not that clear to me what we know about HAART in this game. For instance, in case we know nothing about it and we only observe logical equivalences (in fact rather probabilistic tendencies) in the form “HAART” <--> “Patient dies (within a specified time interval)” and “no HAART” <--> “Patient survives” it wouldn’t be irrational to reject the treatment.
Once we know more about HAART, for instance, that the probabilistic tendencies were due to unknowingly comparing sick people to healthy people, we then can figure out that P( patient survives | sick, HAART) > P (patient survives | sick, no HAART) and that P( patient survives | healthy, HAART)< P(patient survives | healthy, no HAART). Knowing that much, choosing not to give the drug would be a foolish thing to do. If we come to know that a particular reasoning R leads to not prescribing the drug (even after the update above) is very strongly correlated with having patients that are completely healthy but show false-positive clinical test results, then not prescribing the drug would be the better thing to do. This, of course, would require that this new piece of information brings about true predictions about future cases (which makes the scenario quite unlikely, though considering the theoretical debate it might be relevant).
Generally, I think that drawing causal diagrams is a very useful heuristic in “everyday science”, since replacing the term causality with all the conditionals involved might be confusing. Maybe this is a reason why some people tend to think that evidential reasoning is defined to only consider plain conditionals (in this example P(survival| HAART)) but not more background data. Because otherwise, in effortful ways you could receive the same answer as causal reasoners do but what would be the point of imitating CDT?
I think it is exactly the other way round. It’s all about conditionals. It seems to me that a bayesian writes down “causal connection” in his/her map after updating on sophisticated sets of correlations. It seems impossible to completely rule out confounding at any place. Since evidential reasoning would suggest not to prescribe the drug in the false-positive scenario above its output is not similiar to the one conventional CDT produces. Differences between CDT and the non-naive evidential approach are described here as well: http://lesswrong.com/lw/j5j/chocolate_ice_cream_after_all/a6lh
It seems that CDT-supporters only do A if there is a causal mechanism connecting it with the desirable outcome B. An evidential reasoner would also do A if he knew that there would be no causal mechanism connecting it to B, but a true (but purely correlative) prediction stating the logical equivalences A<-->B and ~A <--> ~B.
“A set of 100 HIV patients are randomized to receive HAART at time 0. Some time passes, and their vitals are measured at time 1. Based on this measurement some patients receive HAART at time 1 (some of these received HAART at time 0, and some did not). Some more time passes, and some patients die at time 2. Some of those that die at time 2 had HAART at both times, or at one time, or at no time. You have a set of records that show you, for each patient of 100, whether they got HAART at time 0 (call this variable A0), whether they got HAART at time 1 (call this variable A1), what their vitals were at time 1 (call this variable W), and whether they died or not at time 2 (call this variable Y). A new patient comes in, from the same population as the original 100. You want to determine how much HAART to give him. That is, should {A0,A1} be set to yes,yes; yes,no; no,yes; or no,no. Your utility function rewards you for keeping patients alive. What is your decision rule for prescribing HAART for this patient?”
From the point of view of EDT, the set of records containing values of A0,W,A1,Y for 100 patients is all you get to see. (Someone using CDT would get a bit more information than this, but this isn’t relevant for EDT). I can tell you that based on the records you see, p(Y=death | A0=yes,A1=yes) is higher than p(Y=death | A0=no,A1=no). I am also happy to answer any additional questions you may have about p(A0,W,A1,Y). This is a concrete problem with a correct answer. What is it?
I don’t understand why you persist in blindly converting historical records into subjective probabilities, as though there was no inference to be done. You can’t just set p(Y=death | A0=yes,A1=yes) to the proportion of deaths in the data, because that throws away all the highly pertinent information you have about biology and the selection rule for “when was the treatment applied”. (EDIT: ignoring the covariate W would cause Simpson’s Paradox in this instance)
EDIT EDIT: Yes, P(Y = death in a randomly-selected line of the data | A0=yes,A1=yes in the same line of data) is equal to the proportion of deaths in the data, but that’s not remotely the same thing as P(this patient dies | I set A0=yes,A1=yes for this patient).
I was just pointing out that in the conditional distribution p(Y|A0,A1) derived from the empirical distribution some facts happen to hold that might be relevant. I never said what I am ignoring, I was merely posing a decision problem for EDT to solve.
The only information about biology you have is the 100 records for A0,W,A1,Y that I specified. You can’t ask for more info, because there is no more info. You have to decide with what you have.
The information about biology I was thinking of is things like “vital signs tend to be correlated with internal health” and “people with bad internal health tend to die”. Information it would be irresponsible to not use.
But anyway, the solution is to calculate P(this patient dies | I set A0=a0,A1=a1 for this patient, data) (I should have included the conditioning on data above but I forgot) by whatever statistical methods are relevant, then to do whichever option of a0,a1 gives the higher number. Straightforward.
You can approximate P(this patient dies | I set A0=a0,A1=a1 for this patient, data) with P_empirical(Y=death | do(A0=a0,A1=a1)) from the data, on the assumption that our decision process is independent of W (which is reasonable, since we don’t measure W). There are other ways to calculate P(this patient dies | I set A0=a0,A1=a1 for this patient, data), like Solomonoff induction, presumably, but who would bother with that?
I agree with you broadly, but this is not the EDT solution, is it? Show me a definition of EDT in any textbook (or Wikipedia, or anywhere) that talks about do(.).
Yes, P(Y = death in a randomly-selected line of the data | A0=yes,A1=yes in the same line of data) is
equal to the proportion of deaths in the data, but that’s not remotely the same thing as P(this patient
dies | I set A0=yes,A1=yes for this patient).
Yes, of course not. That is the point of this example! I was pointing out that facts about p(Y | A0,A1) aren’t what we want here. Figuring out the distribution that is relevant is not so easy, and cannot be done merely from knowing p(A0,W,A1,Y).
EDT uses P(this patient dies | I set A0=a0,A1=a1 for this patient, data) while CDT uses P(this patient dies | do(I set A0=a0,A1=a1 for this patient), data).
EDT doesn’t “talk about do” because P(this patient dies | I set A0=a0,A1=a1 for this patient, data) doesn’t involve do. It just happens that you can usually approximate P(this patient dies | I set A0=a0,A1=a1 for this patient, data) by using do (because the conditions for your personal actions are independent of whatever the conditions for the treatment in the data were).
Let me be clear: the use of do I describe here is not part of the definition of EDT. It is simply an epistemic “trick” for calculating P(this patient dies | I set A0=a0,A1=a1 for this patient, data), and would be correct even if you just wanted to know the probability, without intending to apply any particular decision theory or take any action at all.
Also, CDT can seem a bit magical, because when you use P(this patient dies | do(I set A0=a0,A1=a1 for this patient), data), you can blindly set the causal graph for your personal decision to the empirical causal graph for your data set, because the do operator gets rid of all the (factually incorrect) correlations between your action and variables like W.
Criticisms section in the Wikipedia article on EDT :
David Lewis has characterized evidential decision theory as promoting “an irrational policy of managing the news”.[2] James M. Joyce asserted, “Rational agents choose acts on the basis of their causal efficacy, not their auspiciousness; they act to bring about good results even when doing so might betoken bad news.”[3]
Where in the wikipedia EDT article is the reference to “I set”? Or in any text book? Where are you getting your EDT procedure from? Can you show me a reference? EDT is about conditional expectations, not about “I set.”
One last question: what is P(this patient dies | I set A0=a0,A1=a1 for this patient, data) as a function of P(Y,A0,W,A1)? If you say “whatever p_empirical(Y | do(A0,A1)) is”, then you are a causal decision theorist, by definition.
I don’t strongly recall when I last read a textbook on decision theory, but I remember that it described agents using probabilities about the choices available in their own personal situation, not distributions describing historical data.
Pragmatically, when you build a robot to carry out actions according to some decision theory, the process is centered around the robot knowing where it is in the world, and making decisions with the awareness that it is making the decisions, not someone else. The only actions you have to choose are “I do this” or “I do that”.
I would submit that a CDT robot makes decisions on the basis of P(outcome | do(I do this or that), sensor data) while a hypothetical EDT robot would make decisions based on P(outcome | I do this or that, sensor data). How P(outcome | I do this or that, sensor data) is computed is a matter of personal epistemic taste, and nothing for a decision theory to have any say about.
(It might be argued that I am steel-manning the normal description of EDT, since most people talking about it seem to make the error of blindly using distributions describing historical data as P(outcome | I do this or that, sensor data), to the point where that got incorporated into the definition. In which case maybe I should be writing about my “new” alternative to CDT in philosophy journals.)
I think you steel-manned EDT so well, that you transformed it into CDT, which is a fairly reasonable decision theory in a world without counterfactually linked decisions.
I mean Pearl invented/popularized do(.) in the 1990s sometime. What do you suppose EDT did before do(.) was invented? Saying “ah, p(y | do(x)) is what we meant all along” after someone does the hard work to invent the theory for p(y | do(x)) doesn’t get you any points!
I disagree. The calculation of P(outcome | I do this or that, sensor data) does not require any use of do when there are no confounding covariates, and in the case of problems such as Newcomb’s, you get a different answer to CDT’s P(outcome | do(I do this or that), sensor data) — the CDT solution throws away the information about Omega’s prediction.
CDT isn’t a catch-all term for “any calculation that might sometimes involve use of do”, it’s a specific decision theory that requires you to use P(outcome | do(action), data) for each of the available actions, whether or not that throws away useful information about correlations between yourself and stuff in the past.
EDIT: Obviously, before do() was invented, if you were using EDT you would do what everyone else would do: throw up your hands and say “I can’t calculate P(outcome | I do this or that, sensor data); I don’t know how to deal with these covariates!”. Unless there weren’t any, in which case you just go ahead and estimate your P from the data. I’ve already explained that the use of do() is only an inference tool.
I think you still don’t get it. The word “confounder” is causal. In order to define what a “confounding covarite” means, vs a “non-confounding covariate” you need to already have a causal model. I have a paper in Annals on this topic with someone, actually, because it is not so simple.
So the very statement of “EDT is fine without confounders” doesn’t even make sense within the EDT framework. EDT uses the framework of “probability theory.” Only statements expressible within probability theory are allowed. Personally, I think it is in very poor taste to silently adopt all the nice machinery causal folks have developed, but not acknowledge that the ontological character of the resulting decision theory is completely different from the terrible state it was before.
Incidentally the reason CDT fails on Newcomb, etc. is the same—it lacks the language powerful enough to talk about counterfactually linked decisions, similarly to how EDT lacks the language to talk about confounding. Note : this is an ontological issue not an algorithmic issue. That is, it’s not that EDT doesn’t handle confounders properly, it’s that it doesn’t even have confounders in its universe of discourse. Similarly, CDT only has standard non-linked interventions, and so has no way to even talk about Newcomb’s problem.
The right answer here is to extend the language of CDT (which is what TDT et al essentially does).
I’m aware that the “confounding covariates” is a causal notion. CDT does not have a monopoly on certain kinds of mathematics. That would be like saying “you’re not allowed to use the Pythagorean theorem when you calculate your probabilities, this is EDT, not Pythagorean Decision Theory”.
Do you disagree with my statement that EDT uses P(outcome | I do X, data) while CDT uses P(outcome | do(I do X), data)? If so, where?
So the very statement of “EDT is fine without confounders” doesn’t even make sense within the EDT framework. EDT uses the framework of “probability theory.”
Are you saying it’s impossible to write a paper that uses causal analysis to answer the purely epistemic question of whether a certain drug has an effect on cancer, without invoking causal decision theory, even if you have no intention of making an “intervention”, and don’t write down a utility function at any point?
I am simultaneously having a conversation with someone who doesn’t see why interventions cannot be modeled using conditional probabilities, and someone who doesn’t see why evidential decision theory can’t just use interventions for calculating what the right thing to do is.
Let it never be said that LW has a groupthink problem!
CDT does not have a monopoly on certain kinds of mathematics.
Yes, actually it does. If you use causal calculus, you are either using CDT or an extension of CDT. That’s what CDT means.
P(outcome | I do X, data)
I don’t know what the event ’I do X” is for you. If it satisfies the standard axioms of do(x) (consistency, effectiveness, etc.) then you are just using a different syntax for causal decision theory. If it doesn’t satisfy the standard axioms of do(x) it will give the wrong answers.
Are you saying it’s impossible to write a paper that uses causal analysis to answer the purely epistemic
question of whether a certain drug has an effect on cancer
Papers on effects of treatments in medicine are either almost universally written using Neyman’s potential outcome framework (which is just another syntax for do(.)), or they don’t bother with special causal syntax because they did an RCT directly (in which case a standard statistical model has a causal interpretation).
“I do X” literally means the event where the agent (the one deciding upon a decision) takes the action X. I say “I do X” to distinguish this from “some agent in a data set did X”, because even without talking about causality, these are obviously different things.
The way you are talking about axioms and treating X as a fundamental entity suggests our disagreement is about the domain on which probability is being applied here. You seem to be conceiving of everything as referring to the empirical causal graph inferred from the data, in which case “X” can be considered to be synonymous to “an agent in the dataset did X”.
“Reflective” decision theories like TDT, and my favoured interpretation of EDT require you to be able to talk about the agent itself, and infer a causal graph (although EDT, being “evidential”, doesn’t really need a causal graph, only a probability distribution) describing the causes and consequences of the agent taking their action. The inferred causal graph need not have any straightforward connection to the empirical distribution of the dataset. Hence my talk of P as opposed to P_empirical.
So, to summarize ,”I do X” is not a operator, causal or otherwise, applied to the event X in the empirical causal graph. It is an event in an entirely separate causal graph describing the agent. Does that make sense?
(a) You are actually using causal graphs. Show me a single accepted definition of evidential decision theory that allows you to do that (or more precisely that defines, as a part of its decision rule definition, what a causal graph is).
(b) You have to somehow be able to make decisions in the real world. What sort of data do you need to be able to apply your decision rule, and what is the algorithm that gives you your rule given this data?
(a) Well, you don’t really a need a causal graph; a probability distribution for the agent’s situation will do. Although it might be convenient to represent it as a causal graph. Where I have described the use of causal graphs above, they are merely a component of the reasoning used to infer your probability distribution within probability theory.
(b) That is, a set of hypotheses you might consider would include G = “the phenomenon I am looking at behaves in a manner described by graph G”. Then you calculate the posterior probability P(G | data) × the joint distribution over the variables of the agent’s situation given G, and integrate over G to get the posterior distribution for the agent’s situation.
Given that, you decide what to do based on expected utility with P(outcome | action, data). Obviously, the above calculation is highly nontrivial. In principle you could just use some universal prior (ie. Solomonoff induction) to calculate the posterior distribution for the agent instead, but that’s even less practical.
In practice you can often approximate this whole process fairly well by assuming the only difference between our situation and the data to be that our decision is uncorrelated with whatever decision procedure was used in the data, and treating it as an “intervention” (which I think might correspond to just using the most likely G, and ignoring all other hypotheses).
(a) Well, you don’t really a need a causal graph; a probability distribution for the agent’s situation will do.
Although it might be convenient to represent it as a causal graph. Where I have described the use of causal
graphs above, they are merely a component of the reasoning used to infer your probability distribution
within probability theory.
Well, you have two problems here. The first (and bigger) problem is you are committing an ontological error (probability distributions are not about causality but about uncertainty. It doesn’t matter if you are B or F about it). The second (smaller, but still significant) problem is that probability distributions by themselves do not contain the information that you want. In other words, you don’t get identifiability of the causal effect in general if all you are given is a probability distribution. To use a metaphor Judea likes to use, if you have a complete surface description of how light reflection works on an item (say a cup), you can construct a computer graphics engine that can render the cup from any angle. But there is no information on how the cup is to be rendered under deformation (that is, if I smash the cup on the table, what will it look like?)
Observed joint probability distributions—surface information, interventional distributions—information after deformations. It might be informative to consider how your (Bayesian?) procedure would work in the cup example. The analogy is almost exact, the set of interventional densities is a much bigger set than the set of observed joint distributions.
I would be very interested in what you think the right decision rule is for my 5 node HAART example.
In my example you don’t have to average over possible graphs, because my hypothetical is that we know what the correct graph is (and what the correct corresponding distribution is).
Presumably your answer will take the form of either [decision rule given some joint probability distribution that does not mention any causal language] or “not enough information for an answer.”
If your answer is the latter, your decision theory is not very good. If the former (and by some miracle the decision rule gives the right answer), I would be very interested in a (top level?) post that works out how you recover the correct properties of causal graphs from just probability distributions. If correct, you could easily publish this in any top statistics journal and revolutionize the field. My intuition is that 100 years of statistics is not in fact wrong, and as you start dealing with more and more complex problems (I can generate an inexhaustible list of these), there will come up a lot of “gotchas” that causal folks already dealt with. In order to deal with these “gotchas” you will have to modify and modify your proposal until effectively you just reinvent intervention calculus.
I would be very interested in what you think the right decision rule is for my 5 node HAART example. In my example you don’t have to average over possible graphs, because my hypothetical is that we know what the correct graph is (and what the correct corresponding distribution is).
Your graph describes the data generation stochastic process. The agent needs a different one to model the situation it is facing. If it uses the right graph (or more generally, the right joint probability distribution, which doesn’t have to be factorizable), then it will get the right answer.
How to go from a set of data and a model of the data generation process to a model of the agent situation process is, of course, a non trivial problem, but it is not part of the agent decision problem.
Ok, these problems I am posing are not abstract, they are concrete problems in medical decision making. In light of http://lesswrong.com/lw/jco/examples_in_mathematics/, I am going to pose 4 of them, right here, then tell you what the right answer is, and what assumptions I used to get this answer. Whatever decision theory you are using needs to be able to correctly represent and solve these problems, using at most the information that my solutions use, or it is not a very good decision theory (in the sense that there exist known alternatives that do solve these problems correctly). In all problems our utility penalizes patient deaths, or patients getting a disease.
In particular, if you are a user of EDT, you need to give me a concrete algorithm that correctly solves all these problems, without using any causal vocabulary. You can’t just beg off on how it’s a “non trivial problem.” These are decision problems, and there exist solutions for these problems now, using CDT! Can EDT solve them or not? I have yet to see anyone try to seriously engage these (with the notable exception of Paul, who to his credit did try to give a Bayesian/non-causal account of problem 3, but ran out of time).
Note : I am assuming the correct graph, and lots of samples, so the effect of a prior is not significant (and thus talk about empirical frequencies, e.g. p(c)). If we wanted to, we could do a Bayesian problem over possible causal graphs, with a prior, and/or a Bayesian problem for estimation, where we could talk about, for example, the posterior distribution of case histories C. I skipped all that to simplify the examples.
Problem 1:
We perform a randomized control trial for a drug (half the patients get the drug, half the patients do not). Some of the patients die (in both groups). Let A be a random variable representing whether the patient in our RCT dataset got the drug, and Y be a random variable representing whether the patient in our RCT dataset died. A new patient comes in which is from the same cohort as those in our RCT. Should we give them the drug?
Solution: Give the drug if and only if E[Y = yes | A = yes] < E[Y = yes | A = no].
Intuition for why this is correct: since we randomized the drug, there are no possible confounders between drug use and death. So any dependence between the drug and death is causal. So we can just look at conditional correlations.
Assumptions used: we need the empirical p(A,Y) from our RCT, and the assumption that the correct causal graph is A → Y. No other assumptions needed.
Ideas here: you should be able to transfer information you learn from observed members in a group to others members of the same group. Otherwise, what is stats even doing?
Problem 2:
We perform an observational study, where doctors assign (or not) a drug based on observed patient vitals recorded in their case history file. Some of the patients die. Let A be a random variable representing whether the patient in our dataset from our study got the drug, Y be a random variable representing whether the patient in our study died, and C be the random variable representing the patient vitals used by the doctors to decide whether to give the drug or not. A new patient comes in which is from the same cohort as those in our study. If we do not get any additional information on this patient, should we give them the drug?
Solution: Give the drug if and only if \sum{c} E[Y = yes | A = yes, c] p(c) < \sum{c} E[Y = yes | A = no, c] p(c)
Intuition for why this is correct: we have not randomized the drug, but we recorded all the info doctors used to decide on whether to give the drug. Since case history C represents all possible confounding between A and Y, conditional on knowing C, any dependence between A and Y is causal. In other words, E[Y | A, C] gives a causal dependence of A and Y, conditional on C. But since we are not allowed to measure anything about the incoming patient, we have to average over the possible case histories the patient might have. Since the patient is postulated to have come from the same dataset as those in our study, it is reasonable to average over the observed case histories in our study. This recovers the above formula.
Assumptions used: we need the empirical p(A,C,Y) from our study, and the assumption that the correct causal graph is C → A → Y, C → Y. No other assumptions needed.
Ideas here: this is isomorphic to the smoking lesion problem. The idea here is you can’t use observed correlations if there are confounders, you have to adjust for confounders properly using the g-formula (the formula in the answer).
Problem 3:
We perform a partially randomized and partially observational longitudinal study, where patients are randomly assigned (or not) a drug at time 0, then their vitals at time 1 are recorded in a file, and based on those vitals, and the treatment assignment history at time 0, doctors may (or not), decide to give them more of the drug. Afterwards, at time 2, some patients die (or not). Let A0 be a random variable representing whether the patient in our dataset from our study got the drug at time 0, A1 be a random variable representing whether the patient in our dataset from our study got the drug at time 1, Y be a random variable representing whether the patient in our study died, and C be the random variable representing the case history used by the doctors to decide whether to give the drug or not at time 1. A new patient comes in which is from the same cohort as those in our study. If we do not get any additional information on this patient, should we give them the drug, and if so at what time points?
Solution: Use the drug assignment policy (a0,a1) that minimizes \sum{c} E[Y = yes | A1 = a1, c, A0 = a0] p(c | A0 = a0).
Intuition for why this is correct: we have randomized A0, but have not randomized A1, and we are interested in the joint effect of both A0 and A1 on Y. We know C is a confounder for A1, so we have to adjust for it somehow as in Problem 2, otherwise an observed dependence of A1 and Y will contain a non-causal component through C. However, C is not a confounder for the relationship of A0 and Y. Conditional on A0 and C, the relationship between A1 and Y is entirely causal, so E[Y | A1, C, A0] is a causal quantity. However, for the incoming patient, we are not allowed to measure C, so we have to average over C as before in problem 2. However, in our case C is an effect of A0, which means we can’t just average the base rates for case histories, we have to take into account what happened at time 0, in other words the causal effect of A0 on C. Because in our graph, there are no confounders between A0 and C, the causal relationship can be represented by p(C | A0) (no confounders means correlation equals causation). Since A0 also has no confounders for Y, E[Y | A1, C, A0], weighted by p(C | A0) gives us the right causal relationship between {A0,A1} and Y.
Assumptions used: we need the empirical p(A0,C,A1,Y) from our study, and the assumption that the correct causal graph is A0 → C → A1 → Y, A0 → A1, A0 → Y, and we possibly allow that there is an unrestricted hidden variable U that is a parent of both C and Y. No other assumptions needed.
Ideas here: simply knowing that you have confounders is not enough, you have to pay attention to the precise causal relationships to figure out what the right thing to do is. In this case, C is a ‘time-varying confounder,’ and requires a more complicated adjustment that takes into account that the confounder is also an effect of an earlier treatment.
Problem 4:
We consider a (hypothetical) observational study of coprophagic treatment of stomach cancer. It is known (for the purposes of this hypothetical example) that coprophagia’s protective effect vs cancer is due to the presence of certain types of intestinal flora in feces. At the same time, people who engage in coprophagic behavior naturally are not a random sampling of the population, and therefore may be more likely than average to end up with stomach cancer. Let A be a random variable representing whether those in our study engaged in coprophagic behavior, let W be the random variable representing the presence of beneficial intestinal flora, let Y be the random variable representing the presence of stomach cancer, and let U be some unrestricted hidden variable which may influence both coprophagia and stomach cancer. A new patient at risk for stomach cancer comes in which is from the same cohort as those in our study. If we do not get any additional information on this patient, should we give them the coprophagic treatment as a preventative measure?
Solution: Yes, if and only if \sum{w} p(W = w | A = yes) \sum{a} E[Y = yes | W = w, A = a) p(A = a) < \sum{w} p(W = w | A = no) \sum{a} E[Y = yes | W = w, A = a) p(A = a)
Intuition for why this is correct: since W is independent of confounders for A and Y, and A only affects Y through W, the effect of A on Y decomposes/factorizes into an effect of A on W, and an effect of W on Y, averaged over possible values W could take. The effect of A on W is not confounded by anything, and so is equal to p(W | A). The effect of W on Y is confounder by A, but given our assumptions, conditioning on A is sufficient to remove all confounding for the effect, which gives us \sum{A} p(Y | W,A) p(A). This gives above formula.
Assumptions used: we need the empirical p(A,C,Y) from our study, and the assumption that the correct causal graph is A → W → Y, and there is an unrestricted hidden variable U that is a parent of both A and Y. No other assumptions needed.
Ideas here: sometimes your independences let you factorize effects into other effects, similarly to how Bayesian networks factorize. This lets you solve problems that might seem unsolvable due to the presence of unobserved confounding.
The first (and bigger) problem is you are committing an ontological error (probability distributions are not about causality but about uncertainty. It doesn’t matter if you are B or F about it).
I don’t know what you mean by this. Probability distributions can be about whatever you want — it makes perfect sense to speak of “the probability that the cause of X is Y, given some evidence”.
It is not that clear to me what we know about HAART in this game. For instance, in case we know nothing about it and we only observe logical equivalences (in fact rather probabilistic tendencies) in the form “HAART” <--> “Patient dies (within a specified time interval)” and “no HAART” <--> “Patient survives” it wouldn’t be irrational to reject the treatment.
Once we know more about HAART, for instance, that the probabilistic tendencies were due to unknowingly comparing sick people to healthy people, we then can figure out that P( patient survives | sick, HAART) > P (patient survives | sick, no HAART) and that P( patient survives | healthy, HAART)< P(patient survives | healthy, no HAART). Knowing that much, choosing not to give the drug would be a foolish thing to do.
If we come to know that a particular reasoning R leads to not prescribing the drug (even after the update above) is very strongly correlated with having patients that are completely healthy but show false-positive clinical test results, then not prescribing the drug would be the better thing to do. This, of course, would require that this new piece of information brings about true predictions about future cases (which makes the scenario quite unlikely, though considering the theoretical debate it might be relevant).
Generally, I think that drawing causal diagrams is a very useful heuristic in “everyday science”, since replacing the term causality with all the conditionals involved might be confusing. Maybe this is a reason why some people tend to think that evidential reasoning is defined to only consider plain conditionals (in this example P(survival| HAART)) but not more background data. Because otherwise, in effortful ways you could receive the same answer as causal reasoners do but what would be the point of imitating CDT?
I think it is exactly the other way round. It’s all about conditionals. It seems to me that a bayesian writes down “causal connection” in his/her map after updating on sophisticated sets of correlations. It seems impossible to completely rule out confounding at any place. Since evidential reasoning would suggest not to prescribe the drug in the false-positive scenario above its output is not similiar to the one conventional CDT produces. Differences between CDT and the non-naive evidential approach are described here as well: http://lesswrong.com/lw/j5j/chocolate_ice_cream_after_all/a6lh
It seems that CDT-supporters only do A if there is a causal mechanism connecting it with the desirable outcome B. An evidential reasoner would also do A if he knew that there would be no causal mechanism connecting it to B, but a true (but purely correlative) prediction stating the logical equivalences A<-->B and ~A <--> ~B.
Ok. So what is your answer to this problem:
“A set of 100 HIV patients are randomized to receive HAART at time 0. Some time passes, and their vitals are measured at time 1. Based on this measurement some patients receive HAART at time 1 (some of these received HAART at time 0, and some did not). Some more time passes, and some patients die at time 2. Some of those that die at time 2 had HAART at both times, or at one time, or at no time. You have a set of records that show you, for each patient of 100, whether they got HAART at time 0 (call this variable A0), whether they got HAART at time 1 (call this variable A1), what their vitals were at time 1 (call this variable W), and whether they died or not at time 2 (call this variable Y). A new patient comes in, from the same population as the original 100. You want to determine how much HAART to give him. That is, should {A0,A1} be set to yes,yes; yes,no; no,yes; or no,no. Your utility function rewards you for keeping patients alive. What is your decision rule for prescribing HAART for this patient?”
From the point of view of EDT, the set of records containing values of A0,W,A1,Y for 100 patients is all you get to see. (Someone using CDT would get a bit more information than this, but this isn’t relevant for EDT). I can tell you that based on the records you see, p(Y=death | A0=yes,A1=yes) is higher than p(Y=death | A0=no,A1=no). I am also happy to answer any additional questions you may have about p(A0,W,A1,Y). This is a concrete problem with a correct answer. What is it?
I don’t understand why you persist in blindly converting historical records into subjective probabilities, as though there was no inference to be done. You can’t just set p(Y=death | A0=yes,A1=yes) to the proportion of deaths in the data, because that throws away all the highly pertinent information you have about biology and the selection rule for “when was the treatment applied”. (EDIT: ignoring the covariate W would cause Simpson’s Paradox in this instance)
EDIT EDIT: Yes,
P(Y = death in a randomly-selected line of the data | A0=yes,A1=yes in the same line of data)
is equal to the proportion of deaths in the data, but that’s not remotely the same thing asP(this patient dies | I set A0=yes,A1=yes for this patient)
.I was just pointing out that in the conditional distribution p(Y|A0,A1) derived from the empirical distribution some facts happen to hold that might be relevant. I never said what I am ignoring, I was merely posing a decision problem for EDT to solve.
The only information about biology you have is the 100 records for A0,W,A1,Y that I specified. You can’t ask for more info, because there is no more info. You have to decide with what you have.
The information about biology I was thinking of is things like “vital signs tend to be correlated with internal health” and “people with bad internal health tend to die”. Information it would be irresponsible to not use.
But anyway, the solution is to calculate
P(this patient dies | I set A0=a0,A1=a1 for this patient, data)
(I should have included the conditioning ondata
above but I forgot) by whatever statistical methods are relevant, then to do whichever option of a0,a1 gives the higher number. Straightforward.You can approximate
P(this patient dies | I set A0=a0,A1=a1 for this patient, data)
withP_empirical(Y=death | do(A0=a0,A1=a1))
from the data, on the assumption that our decision process is independent of W (which is reasonable, since we don’t measure W). There are other ways to calculateP(this patient dies | I set A0=a0,A1=a1 for this patient, data)
, like Solomonoff induction, presumably, but who would bother with that?I agree with you broadly, but this is not the EDT solution, is it? Show me a definition of EDT in any textbook (or Wikipedia, or anywhere) that talks about do(.).
Yes, of course not. That is the point of this example! I was pointing out that facts about p(Y | A0,A1) aren’t what we want here. Figuring out the distribution that is relevant is not so easy, and cannot be done merely from knowing p(A0,W,A1,Y).
No, this is the EDT solution.
EDT uses
P(this patient dies | I set A0=a0,A1=a1 for this patient, data)
while CDT usesP(this patient dies | do(I set A0=a0,A1=a1 for this patient), data)
.EDT doesn’t “talk about
do
” becauseP(this patient dies | I set A0=a0,A1=a1 for this patient, data)
doesn’t involvedo
. It just happens that you can usually approximateP(this patient dies | I set A0=a0,A1=a1 for this patient, data)
by usingdo
(because the conditions for your personal actions are independent of whatever the conditions for the treatment in the data were).Let me be clear: the use of
do
I describe here is not part of the definition of EDT. It is simply an epistemic “trick” for calculatingP(this patient dies | I set A0=a0,A1=a1 for this patient, data)
, and would be correct even if you just wanted to know the probability, without intending to apply any particular decision theory or take any action at all.Also, CDT can seem a bit magical, because when you use
P(this patient dies | do(I set A0=a0,A1=a1 for this patient), data)
, you can blindly set the causal graph for your personal decision to the empirical causal graph for your data set, because thedo
operator gets rid of all the (factually incorrect) correlations between your action and variables like W.[ I did not downvote, btw. ]
Criticisms section in the Wikipedia article on EDT :
David Lewis has characterized evidential decision theory as promoting “an irrational policy of managing the news”.[2] James M. Joyce asserted, “Rational agents choose acts on the basis of their causal efficacy, not their auspiciousness; they act to bring about good results even when doing so might betoken bad news.”[3]
Where in the wikipedia EDT article is the reference to “I set”? Or in any text book? Where are you getting your EDT procedure from? Can you show me a reference? EDT is about conditional expectations, not about “I set.”
One last question: what is P(this patient dies | I set A0=a0,A1=a1 for this patient, data) as a function of P(Y,A0,W,A1)? If you say “whatever p_empirical(Y | do(A0,A1)) is”, then you are a causal decision theorist, by definition.
I don’t strongly recall when I last read a textbook on decision theory, but I remember that it described agents using probabilities about the choices available in their own personal situation, not distributions describing historical data.
Pragmatically, when you build a robot to carry out actions according to some decision theory, the process is centered around the robot knowing where it is in the world, and making decisions with the awareness that it is making the decisions, not someone else. The only actions you have to choose are “I do this” or “I do that”.
I would submit that a CDT robot makes decisions on the basis of
P(outcome | do(I do this or that), sensor data)
while a hypothetical EDT robot would make decisions based onP(outcome | I do this or that, sensor data)
. HowP(outcome | I do this or that, sensor data)
is computed is a matter of personal epistemic taste, and nothing for a decision theory to have any say about.(It might be argued that I am steel-manning the normal description of EDT, since most people talking about it seem to make the error of blindly using distributions describing historical data as
P(outcome | I do this or that, sensor data)
, to the point where that got incorporated into the definition. In which case maybe I should be writing about my “new” alternative to CDT in philosophy journals.)I think you steel-manned EDT so well, that you transformed it into CDT, which is a fairly reasonable decision theory in a world without counterfactually linked decisions.
I mean Pearl invented/popularized do(.) in the 1990s sometime. What do you suppose EDT did before do(.) was invented? Saying “ah, p(y | do(x)) is what we meant all along” after someone does the hard work to invent the theory for p(y | do(x)) doesn’t get you any points!
I disagree. The calculation of
P(outcome | I do this or that, sensor data)
does not require any use ofdo
when there are no confounding covariates, and in the case of problems such as Newcomb’s, you get a different answer to CDT’sP(outcome | do(I do this or that), sensor data)
— the CDT solution throws away the information about Omega’s prediction.CDT isn’t a catch-all term for “any calculation that might sometimes involve use of
do
”, it’s a specific decision theory that requires you to useP(outcome | do(action), data)
for each of the available actions, whether or not that throws away useful information about correlations between yourself and stuff in the past.EDIT: Obviously, before
do()
was invented, if you were using EDT you would do what everyone else would do: throw up your hands and say “I can’t calculateP(outcome | I do this or that, sensor data)
; I don’t know how to deal with these covariates!”. Unless there weren’t any, in which case you just go ahead and estimate your P from the data. I’ve already explained that the use ofdo()
is only an inference tool.I think you still don’t get it. The word “confounder” is causal. In order to define what a “confounding covarite” means, vs a “non-confounding covariate” you need to already have a causal model. I have a paper in Annals on this topic with someone, actually, because it is not so simple.
So the very statement of “EDT is fine without confounders” doesn’t even make sense within the EDT framework. EDT uses the framework of “probability theory.” Only statements expressible within probability theory are allowed. Personally, I think it is in very poor taste to silently adopt all the nice machinery causal folks have developed, but not acknowledge that the ontological character of the resulting decision theory is completely different from the terrible state it was before.
Incidentally the reason CDT fails on Newcomb, etc. is the same—it lacks the language powerful enough to talk about counterfactually linked decisions, similarly to how EDT lacks the language to talk about confounding. Note : this is an ontological issue not an algorithmic issue. That is, it’s not that EDT doesn’t handle confounders properly, it’s that it doesn’t even have confounders in its universe of discourse. Similarly, CDT only has standard non-linked interventions, and so has no way to even talk about Newcomb’s problem.
The right answer here is to extend the language of CDT (which is what TDT et al essentially does).
I’m aware that the “confounding covariates” is a causal notion. CDT does not have a monopoly on certain kinds of mathematics. That would be like saying “you’re not allowed to use the Pythagorean theorem when you calculate your probabilities, this is EDT, not Pythagorean Decision Theory”.
Do you disagree with my statement that EDT uses
P(outcome | I do X, data)
while CDT usesP(outcome | do(I do X), data)
? If so, where?Are you saying it’s impossible to write a paper that uses causal analysis to answer the purely epistemic question of whether a certain drug has an effect on cancer, without invoking causal decision theory, even if you have no intention of making an “intervention”, and don’t write down a utility function at any point?
I am simultaneously having a conversation with someone who doesn’t see why interventions cannot be modeled using conditional probabilities, and someone who doesn’t see why evidential decision theory can’t just use interventions for calculating what the right thing to do is.
Let it never be said that LW has a groupthink problem!
Yes, actually it does. If you use causal calculus, you are either using CDT or an extension of CDT. That’s what CDT means.
I don’t know what the event ’I do X” is for you. If it satisfies the standard axioms of do(x) (consistency, effectiveness, etc.) then you are just using a different syntax for causal decision theory. If it doesn’t satisfy the standard axioms of do(x) it will give the wrong answers.
Papers on effects of treatments in medicine are either almost universally written using Neyman’s potential outcome framework (which is just another syntax for do(.)), or they don’t bother with special causal syntax because they did an RCT directly (in which case a standard statistical model has a causal interpretation).
Couldn’t you just be using some trivial decision theory that uses do() in a stupid way and doesn’t extend CDT?
“I do X” literally means the event where the agent (the one deciding upon a decision) takes the action X. I say “I do X” to distinguish this from “some agent in a data set did X”, because even without talking about causality, these are obviously different things.
The way you are talking about axioms and treating X as a fundamental entity suggests our disagreement is about the domain on which probability is being applied here. You seem to be conceiving of everything as referring to the empirical causal graph inferred from the data, in which case “X” can be considered to be synonymous to “an agent in the dataset did X”.
“Reflective” decision theories like TDT, and my favoured interpretation of EDT require you to be able to talk about the agent itself, and infer a causal graph (although EDT, being “evidential”, doesn’t really need a causal graph, only a probability distribution) describing the causes and consequences of the agent taking their action. The inferred causal graph need not have any straightforward connection to the empirical distribution of the dataset. Hence my talk of
P
as opposed toP_empirical
.So, to summarize ,”I do X” is not a operator, causal or otherwise, applied to the event X in the empirical causal graph. It is an event in an entirely separate causal graph describing the agent. Does that make sense?
Fine, but then:
(a) You are actually using causal graphs. Show me a single accepted definition of evidential decision theory that allows you to do that (or more precisely that defines, as a part of its decision rule definition, what a causal graph is).
(b) You have to somehow be able to make decisions in the real world. What sort of data do you need to be able to apply your decision rule, and what is the algorithm that gives you your rule given this data?
(a) Well, you don’t really a need a causal graph; a probability distribution for the agent’s situation will do. Although it might be convenient to represent it as a causal graph. Where I have described the use of causal graphs above, they are merely a component of the reasoning used to infer your probability distribution within probability theory.
(b) That is, a set of hypotheses you might consider would include G = “the phenomenon I am looking at behaves in a manner described by graph G”. Then you calculate the posterior probability
P(G | data)
× the joint distribution over the variables of the agent’s situation given G, and integrate over G to get the posterior distribution for the agent’s situation.Given that, you decide what to do based on expected utility with
P(outcome | action, data)
. Obviously, the above calculation is highly nontrivial. In principle you could just use some universal prior (ie. Solomonoff induction) to calculate the posterior distribution for the agent instead, but that’s even less practical.In practice you can often approximate this whole process fairly well by assuming the only difference between our situation and the data to be that our decision is uncorrelated with whatever decision procedure was used in the data, and treating it as an “intervention” (which I think might correspond to just using the most likely G, and ignoring all other hypotheses).
Well, you have two problems here. The first (and bigger) problem is you are committing an ontological error (probability distributions are not about causality but about uncertainty. It doesn’t matter if you are B or F about it). The second (smaller, but still significant) problem is that probability distributions by themselves do not contain the information that you want. In other words, you don’t get identifiability of the causal effect in general if all you are given is a probability distribution. To use a metaphor Judea likes to use, if you have a complete surface description of how light reflection works on an item (say a cup), you can construct a computer graphics engine that can render the cup from any angle. But there is no information on how the cup is to be rendered under deformation (that is, if I smash the cup on the table, what will it look like?)
Observed joint probability distributions—surface information, interventional distributions—information after deformations. It might be informative to consider how your (Bayesian?) procedure would work in the cup example. The analogy is almost exact, the set of interventional densities is a much bigger set than the set of observed joint distributions.
I would be very interested in what you think the right decision rule is for my 5 node HAART example. In my example you don’t have to average over possible graphs, because my hypothetical is that we know what the correct graph is (and what the correct corresponding distribution is).
Presumably your answer will take the form of either [decision rule given some joint probability distribution that does not mention any causal language] or “not enough information for an answer.”
If your answer is the latter, your decision theory is not very good. If the former (and by some miracle the decision rule gives the right answer), I would be very interested in a (top level?) post that works out how you recover the correct properties of causal graphs from just probability distributions. If correct, you could easily publish this in any top statistics journal and revolutionize the field. My intuition is that 100 years of statistics is not in fact wrong, and as you start dealing with more and more complex problems (I can generate an inexhaustible list of these), there will come up a lot of “gotchas” that causal folks already dealt with. In order to deal with these “gotchas” you will have to modify and modify your proposal until effectively you just reinvent intervention calculus.
Your graph describes the data generation stochastic process. The agent needs a different one to model the situation it is facing. If it uses the right graph (or more generally, the right joint probability distribution, which doesn’t have to be factorizable), then it will get the right answer.
How to go from a set of data and a model of the data generation process to a model of the agent situation process is, of course, a non trivial problem, but it is not part of the agent decision problem.
Ok, these problems I am posing are not abstract, they are concrete problems in medical decision making. In light of http://lesswrong.com/lw/jco/examples_in_mathematics/, I am going to pose 4 of them, right here, then tell you what the right answer is, and what assumptions I used to get this answer. Whatever decision theory you are using needs to be able to correctly represent and solve these problems, using at most the information that my solutions use, or it is not a very good decision theory (in the sense that there exist known alternatives that do solve these problems correctly). In all problems our utility penalizes patient deaths, or patients getting a disease.
In particular, if you are a user of EDT, you need to give me a concrete algorithm that correctly solves all these problems, without using any causal vocabulary. You can’t just beg off on how it’s a “non trivial problem.” These are decision problems, and there exist solutions for these problems now, using CDT! Can EDT solve them or not? I have yet to see anyone try to seriously engage these (with the notable exception of Paul, who to his credit did try to give a Bayesian/non-causal account of problem 3, but ran out of time).
Note : I am assuming the correct graph, and lots of samples, so the effect of a prior is not significant (and thus talk about empirical frequencies, e.g. p(c)). If we wanted to, we could do a Bayesian problem over possible causal graphs, with a prior, and/or a Bayesian problem for estimation, where we could talk about, for example, the posterior distribution of case histories C. I skipped all that to simplify the examples.
Problem 1:
We perform a randomized control trial for a drug (half the patients get the drug, half the patients do not). Some of the patients die (in both groups). Let A be a random variable representing whether the patient in our RCT dataset got the drug, and Y be a random variable representing whether the patient in our RCT dataset died. A new patient comes in which is from the same cohort as those in our RCT. Should we give them the drug?
Solution: Give the drug if and only if E[Y = yes | A = yes] < E[Y = yes | A = no].
Intuition for why this is correct: since we randomized the drug, there are no possible confounders between drug use and death. So any dependence between the drug and death is causal. So we can just look at conditional correlations.
Assumptions used: we need the empirical p(A,Y) from our RCT, and the assumption that the correct causal graph is A → Y. No other assumptions needed.
Ideas here: you should be able to transfer information you learn from observed members in a group to others members of the same group. Otherwise, what is stats even doing?
Problem 2:
We perform an observational study, where doctors assign (or not) a drug based on observed patient vitals recorded in their case history file. Some of the patients die. Let A be a random variable representing whether the patient in our dataset from our study got the drug, Y be a random variable representing whether the patient in our study died, and C be the random variable representing the patient vitals used by the doctors to decide whether to give the drug or not. A new patient comes in which is from the same cohort as those in our study. If we do not get any additional information on this patient, should we give them the drug?
Solution: Give the drug if and only if \sum{c} E[Y = yes | A = yes, c] p(c) < \sum{c} E[Y = yes | A = no, c] p(c)
Intuition for why this is correct: we have not randomized the drug, but we recorded all the info doctors used to decide on whether to give the drug. Since case history C represents all possible confounding between A and Y, conditional on knowing C, any dependence between A and Y is causal. In other words, E[Y | A, C] gives a causal dependence of A and Y, conditional on C. But since we are not allowed to measure anything about the incoming patient, we have to average over the possible case histories the patient might have. Since the patient is postulated to have come from the same dataset as those in our study, it is reasonable to average over the observed case histories in our study. This recovers the above formula.
Assumptions used: we need the empirical p(A,C,Y) from our study, and the assumption that the correct causal graph is C → A → Y, C → Y. No other assumptions needed.
Ideas here: this is isomorphic to the smoking lesion problem. The idea here is you can’t use observed correlations if there are confounders, you have to adjust for confounders properly using the g-formula (the formula in the answer).
Problem 3:
We perform a partially randomized and partially observational longitudinal study, where patients are randomly assigned (or not) a drug at time 0, then their vitals at time 1 are recorded in a file, and based on those vitals, and the treatment assignment history at time 0, doctors may (or not), decide to give them more of the drug. Afterwards, at time 2, some patients die (or not). Let A0 be a random variable representing whether the patient in our dataset from our study got the drug at time 0, A1 be a random variable representing whether the patient in our dataset from our study got the drug at time 1, Y be a random variable representing whether the patient in our study died, and C be the random variable representing the case history used by the doctors to decide whether to give the drug or not at time 1. A new patient comes in which is from the same cohort as those in our study. If we do not get any additional information on this patient, should we give them the drug, and if so at what time points?
Solution: Use the drug assignment policy (a0,a1) that minimizes \sum{c} E[Y = yes | A1 = a1, c, A0 = a0] p(c | A0 = a0).
Intuition for why this is correct: we have randomized A0, but have not randomized A1, and we are interested in the joint effect of both A0 and A1 on Y. We know C is a confounder for A1, so we have to adjust for it somehow as in Problem 2, otherwise an observed dependence of A1 and Y will contain a non-causal component through C. However, C is not a confounder for the relationship of A0 and Y. Conditional on A0 and C, the relationship between A1 and Y is entirely causal, so E[Y | A1, C, A0] is a causal quantity. However, for the incoming patient, we are not allowed to measure C, so we have to average over C as before in problem 2. However, in our case C is an effect of A0, which means we can’t just average the base rates for case histories, we have to take into account what happened at time 0, in other words the causal effect of A0 on C. Because in our graph, there are no confounders between A0 and C, the causal relationship can be represented by p(C | A0) (no confounders means correlation equals causation). Since A0 also has no confounders for Y, E[Y | A1, C, A0], weighted by p(C | A0) gives us the right causal relationship between {A0,A1} and Y.
Assumptions used: we need the empirical p(A0,C,A1,Y) from our study, and the assumption that the correct causal graph is A0 → C → A1 → Y, A0 → A1, A0 → Y, and we possibly allow that there is an unrestricted hidden variable U that is a parent of both C and Y. No other assumptions needed.
Ideas here: simply knowing that you have confounders is not enough, you have to pay attention to the precise causal relationships to figure out what the right thing to do is. In this case, C is a ‘time-varying confounder,’ and requires a more complicated adjustment that takes into account that the confounder is also an effect of an earlier treatment.
Problem 4:
We consider a (hypothetical) observational study of coprophagic treatment of stomach cancer. It is known (for the purposes of this hypothetical example) that coprophagia’s protective effect vs cancer is due to the presence of certain types of intestinal flora in feces. At the same time, people who engage in coprophagic behavior naturally are not a random sampling of the population, and therefore may be more likely than average to end up with stomach cancer. Let A be a random variable representing whether those in our study engaged in coprophagic behavior, let W be the random variable representing the presence of beneficial intestinal flora, let Y be the random variable representing the presence of stomach cancer, and let U be some unrestricted hidden variable which may influence both coprophagia and stomach cancer. A new patient at risk for stomach cancer comes in which is from the same cohort as those in our study. If we do not get any additional information on this patient, should we give them the coprophagic treatment as a preventative measure?
Solution: Yes, if and only if \sum{w} p(W = w | A = yes) \sum{a} E[Y = yes | W = w, A = a) p(A = a) < \sum{w} p(W = w | A = no) \sum{a} E[Y = yes | W = w, A = a) p(A = a)
Intuition for why this is correct: since W is independent of confounders for A and Y, and A only affects Y through W, the effect of A on Y decomposes/factorizes into an effect of A on W, and an effect of W on Y, averaged over possible values W could take. The effect of A on W is not confounded by anything, and so is equal to p(W | A). The effect of W on Y is confounder by A, but given our assumptions, conditioning on A is sufficient to remove all confounding for the effect, which gives us \sum{A} p(Y | W,A) p(A). This gives above formula.
Assumptions used: we need the empirical p(A,C,Y) from our study, and the assumption that the correct causal graph is A → W → Y, and there is an unrestricted hidden variable U that is a parent of both A and Y. No other assumptions needed.
Ideas here: sometimes your independences let you factorize effects into other effects, similarly to how Bayesian networks factorize. This lets you solve problems that might seem unsolvable due to the presence of unobserved confounding.
I don’t know what you mean by this. Probability distributions can be about whatever you want — it makes perfect sense to speak of “the probability that the cause of X is Y, given some evidence”.