Correlation!=causation: returning to my old theme (latest example: is exercise/mortality entirely confounded by genetics?), what is the right way to model various comparisons?
In the systematic reviews, 8 studies compared results of randomised and non-randomised studies across multiple interventions using metaepidemiological techniques. A total of 194 tools were identified that could be or had been used to assess non-randomised studies. 60 tools covered at least 5 of 6 pre-specified internal validity domains. 14 tools covered 3 of 4 core items of particular importance for non-randomised studies. 6 tools were thought suitable for use in systematic reviews. Of 511 systematic reviews that included nonrandomised studies, only 169 (33%) assessed study quality. 69 reviews investigated the impact of quality on study results in a quantitative manner. The new empirical studies estimated the bias associated with non-random allocation and found that the bias could lead to consistent over- or underestimations of treatment effects, also the bias increased variation in results for both historical and concurrent controls, owing to haphazard differences in case-mix between groups. The biases were large enough to lead studies falsely to conclude significant findings of benefit or harm.
…Conclusions: Results of non-randomised studies sometimes, but not always, differ from results of randomised studies of the same intervention. Nonrandomised studies may still give seriously misleading results when treated and control groups appear similar in key prognostic factors. Standard methods of case-mix adjustment do not guarantee removal of bias. Residual confounding may be high even when good prognostic data are available, and in some situations adjusted results may appear more biased than unadjusted results.
So we get pairs of studies, more or less testing the same thing except one is randomized and the other is correlational. Presumably this sort of study-pair dataset is exactly the kind of dataset we would like to have if we wanted to learn how much we can infer causality from correlational data.
But how, exactly, do we interpret these pairs? If one study finds a CI of 0-0.5 and the counterpart finds 0.45-1.0, is that confirmation or rejection? If one study finds −0.5-0.1 and the other 0-0.5, is that confirmation or rejection? What if they are very well powered and the pair looks like 0.2-0.3 and 0.4-0.5? A criterion of overlapping confidence intervals is not what we want.
We could try to get around it by making a very strict criterion: ‘what fraction of pairs have confidence intervals excluding zero for both studies, and the studies are opposite signed?’ This seems good: if one study ‘proves’ that X is helpful and the other study ‘proves’ that X is harmful, then that’s as clearcut a case of correlation!=causation as one could hope for. With a pair of studies like −0.5/-0.1 and +0.1-+0.5, that is certainly a big problem.
The problem with that is that it is so strict that we would hardly ever conclude a particular case was correlation!=causation (few of the known examples are so wellpowered clearcut), leading to systematic overoptimism, and it inherits the typical problems of NHST like generally ignoring costs (if exercise reduces mortality by 50% in correlational studies and 5% in randomized studies, then to some extent correlation=causation but the massive overestimate could easily tip exercise from being worthwhile to not being worthwhile).
We also can’t simply do a two-group comparison and get a result like ‘correlational studies always double the effect on average, so to correct, just halve the effect and then see if that is still statistically-significant’, which is something you can do with, say, blinding or publication bias because it turns out to not be that conveniently simple—it’s not an issue of researchers predictably biasing ratings toward the desired higher outcome or publishing only the results/studies which show the desired results. The randomized experiments seem to turn in larger, smaller, or opposite-signed results at, well, random.
This is a similar problem as with the Reproducibility Project: we would like the replications of the original psychology studies to tell us, in some sense, how ‘trustworthy’ we can consider psychology studies in general. But most of the methods seem to diagnose lack of power as much as anything (the replications were generally powered 80%+, IIRC, which still means that a lot will not be statistically-significant even if the effect is real). Using Bayes factors is helpful in getting us away from p-values but still not the answer.
It might help to think about what is going on in a generative sense. What do I think creates these results? I would have to say that the results are generally being driven by a complex causal network of genes, biochemistry, ethnicity, SES, varying treatment methods etc which throws up an even more complex & enormous set of multivariate correlations (which can be either positive or negative), while effective interventions are few & rare (likewise, can be both positive or negative) but drive the occasional correlation as well. When a correlation is presented by a researcher as an effective intervention, it might be drawn from the large set of pure correlations or it might have come from the set of causals. It is unlabeled and we are ignorant of which group it came from. There is no oracle which will tell us that a particular correlation is or is not causal (that would make life too easy), but then (in this case) we can test it, and get a (usually small) amount of data about what it does in a randomized setting. How do we analyze this?
I would say that what we have here is something quite specific: a mixture model. Each intervention has been drawn from a mixture of two distributions, all-correlation (with a wide distribution allowing for many large negative & positive values) and causal effects (narrow distribution around zero with a few large values), but it’s unknown which of the two it was drawn from and we are also unsure what the probability of drawing from one or the other is. (The problem is similar to my earlier noisy polls: modeling potentially falsified poll data.)
So when we run a study-pair through this, then if they are not very discrepant, the posterior estimate shifts towards having drawn from the causal group in that case—and also slightly increases the overall estimate of the probability of drawing from the causal group; and vice-versa if they are heavily discrepant, in which case it becomes much more probable that there was a draw from the correlational group, and slightly more probable that draws from the correlation group are more common.
At the end of doing this for all the study-pairs, we get estimates of causal/correlation posterior probability for each particular study-pair (which automatically adjusts for power etc and can be further used for decision-theory like ’does this reduce the expected value of the specific treatment of exercise to <=$0?), but we also get an overall estimate of the switching probability—which tells us in general how often we can expect tested correlations like these to be causal.
I think this gives us everything we want. Working with distributions avoids the power issues, for any specific treatment we can give estimates of being causal, we get an overall estimate as a clear unambiguous probability, etc.
So we get pairs of studies, more or less testing the same thing except one is randomized and the other is correlational.
If I got such data I would (a) be very happy, (b) use the RCT to inform policy, and (c) use the pair to point out how correct causal inference methods can recover the RCT result if assumptions hold (hopefully they hold in the observational study). We can try to combine strength of two studies, but then the results live or die by assumptions on how treatments were assigned in the observational study.
I am also not a fan of classifying biases like they do (it’s a common silly practice). For example, it’s really not informative to say “confounding bias,” in reality you can have a lot of types of confounding, with different solutions necessary depending on the type (I like to draw pictures to understand this).
I think Robins et al (?Hernan?) at some point recovered the result of an RCT via his g methods from observational data.
The controversy about hormone replacement therapy is fascinating as a case study. Until 2002, essentially all women who reached menopause got medical advise to start taking pills containing horse estrogen. It was very widely believed that this would reduce their risk of having a heart attack. This belief primarily based on biological plausibility: Estrogen is known to reduce cholesterol, and cholesterol is believed to increase the risk of heart disease. Also, there were many observational studies that seemingly suggested that women who took hormone replacement therapy (HRT) had less risk of heart disease. (In my view, this was not surprising: Observational studies always show what the investigators expect to find.)
In 2002, the Women’s Health Initiative randomized trial was stopped early because it showed that estrogen replacement therapy actually substantially increased the risk of having a heart attack. Overnight, the medical establishment stopped recommending estrogen for menopausal women. But a perhaps more important consequence was that many clinicians stopped trusting observational studies altogether. In my opinion, this was mostly a good thing.
The largest observational study to show a protective effect of estrogen the Nurses Health Study. In 2008, my thesis advisor Miguel Hernan re-analyzed this dataset using Jamie Robins’ g-methods (which are equivalent to Pearl), and was essentially able to reproduce the results of the WHI trial. Miguel’s paper uses valid methods and gets the correct results. In my view, this shows that the new methods might work, but the paper would have meant much more if it was published prior to the randomized trials.
Miguel and Jamie’s paper sparked off a very interesting methodological debate with the original investigators at the Nurses Health Study, who are still clinging to their original analysis. See http://www.ncbi.nlm.nih.gov/pubmed/18813017 .
Many people still believe that Estrogen/HRT is beneficial. The most popular theory is that WHI recruited too many old women (sometimes in their 90s!) and that estrogen is harmful if given that long after menopause. A new randomized trial which is limited to women at menopause is currently being conducted. A second theory is that the results in the trial were due to differences in statin usage. I analyzed the second theory for my doctoral thesis, but found that this had negligible impact on the results.
It is also interesting to note that while it is true that the trial found that estrogen increased the risk of heart disease, it also showed a (non-significant) reduction in all-cause mortality. So the increased risk of cardiovascular disease didn’t even result in more deaths. Presumably, people care more about all-cause mortality than heart attacks. However, since it was “non-significant”, not even the most dedicated proponents of estrogen treatment ever point out this fact.
A side question, prompted by an amusing factoid in the Hernan paper: ”...we restricted the population to women who had reported plausible energy intakes (2510 –14,640 kJ/d)”.
In the statistical analysis in this paper, and also as a general practice in medical publications based on questionnaire data, are there adjustments for uncertainty in the questionnaire responses?
When you have a data point that says, for example, that person #12345 reports her caloric intake as 4,000 calories/day, do you take it as a hard precise number, or do you take it as an imprecise estimate with its own error which propagates into the model uncertainty, etc.?
Keyword is “measurement error.” People think hard about this. Anders_H knows this paper in a lot more detail than I do, but I expect these particular authors to be careful.
This issue is also related to “missing data.” What you see might be different from the underlying truth in systematic ways, e.g. you get systematic bias in your data, and you need to deal with that. This is also related to that causal inference stuff I keep going on about.
Keyword is “measurement error.” People think hard about this.
People like engineers and physicists think a lot about this. I am not sure that medical researchers think a lot about this. The usual (easy) way is to throw out unreasonable-looking responses during the data cleaning and then take what remains as rock-solid. Accepting that your independent variables are uncertain leads to a lot of inconvenient problems (starting with the OLS regression not being a theoretically-correct form any more).
What you see might be different from the underlying truth in systematic ways, e.g. you get systematic bias in your data, and you need to deal with that.
Yes, that’s another can of worms. In some areas (e.g. self-reported food intake) the problem is so blatant and overwhelming that you have to deal with it, but if it looks minor not many people want to bother.
Just putting the idea out for comment in case there’s some way this fails to deliver what I want it to deliver. Excerpting out all the comparisons and writing up the mixture model in JAGS would be a lot of work; just reading the papers takes long enough as it is.
If I got such data I would (a) be very happy, (b) use the RCT to inform policy, and (c) use the pair to point out how correct causal inference methods can recover the RCT result if assumptions hold (hopefully they hold in the observational study)
Indeed. You can imagine that when I stumbled across Deeks and the rest of then in Google Scholar (my notes), I was overjoyed by their obvious utility (and because it meant I didn’t have to do it myself, as I was musing about doing using FDA trials) but also completely baffled: why had I never heard of these papers before?
I am not following your mixture model idea. For every data point you know if it comes from the RCT or observational study. You don’t need uncertainty about treatment assignment. What you need is figuring out how to massage observational data to get causal conclusions (e.g. what I think about all day long).
If you have specific observational data you want to look at, email me if you want to chat more.
For every data point you know if it comes from the RCT or observational study. You don’t need uncertainty about treatment assignment.
No, the uncertainty here isn’t about which of the two studies a datapoint came from, but about whether (for a specific treatment/intervention) the correlational study datapoint was drawn from the same distribution as the randomized study datapoint or a different one, and (over all treatments/interventions) what the probability of being drawn from the same distribution is. Maybe it’ll be a little clearer if I narrate how the model might go.
So say you start off with a prior probability of 50-50 about which group a result is drawn from, a switching probability that will be tweaked as you look at data. (If you are studying turtles which could be from a large or a small species, then if you find 2 larger turtles and 8 smaller, you’re probably going to update from P=0.5 to a mixture probability more like P>0.20, since it’s most likely—but not certain—that 1 or 2 of the larger turtles came from the large species and the 8 smaller ones came from the small species.)
For your first datapoint, you have a pair of results: xyzcillin reduces all-cause mortality to RR=0.5 from a correlational study (cohort, cross-sectional, case-control, whatever), and the randomized study of xyzcillin has RR=1.1. What does this mean? Now, of course you know that 0.5 is the correlational result and 1.1 is the randomized result, but we can imagine two relatively distinct scenarios here: ‘xyzcillin actually works but the causal effect is really more like RR=0.7 and the randomized trial was underpowered’, or, ‘xyzcillin has no causal effect whatsoever on mortality and it’s just a bunch of powerful confounds producing results like RR=0.6-0.8’. We observe that 1.1 supports the latter more, and we update towards ‘xyzcillin has 0 effect’ and now give ‘non-causal scenarios are 55% likely’, but not too much because the xyzcillin studies were small and underpowered and so they don’t support the latter scenario that much.
Then for the next datapoint, ‘abcmycin reduces lung cancer’, we get a pair looking like 0.9 and 0.7, and we observe these large trials are very consistent with each other and so they highly support the former theory instead and we update towards ‘abcmycin causally reduces lung cancer’ and ‘noncausal scenarios are 39% likely’.
Then for the third datapoint about defracic surgery for backpain, we again get consistency like d=0.7 and d=0.5 and we update the probability that ‘defracic surgery reduces back pain’ and also push even further ’noncausal scenarios are 36% likely” because their sample sizes were decent.
And we do update for each pair we finish, and after bouncing back and forth with each pair, we wind up with an estimate that Nature draws from the non-causal scenario 37% of the time (ie the switching probability of the mixture is p=0.37). And now we can use that as a prior in evaluating any new medicine or surgery.
If you have specific observational data you want to look at, email me if you want to chat more.
If you want to look at specific study-pairs, they’re all listed & properly cited in the papers I’ve collated & provided fulltext links for. I suspect that the more advanced methods will require individual level patient data, which sadly only a very few studies will release, but perhaps you can still find enough of those to make it worth your while and analyze if Robins et al can get a publishable paper out of just 1 RCT.
If I understood you correctly, there are two separate issues here.
The first is what people call “transportability” (how to sensibly combine results of multiple studies if units in those studies aren’t the same). People try all sorts of things (Gelman does random effects models I think?) Pearl’s student Elias Barenboim (now at Purdue) thinks about that stuff using graphs.
I wish I could help, but I don’t know as much about this subject as I want. Maybe I should think about it more.
The second issue is that in addition to units in two studies “not being the same” one study is observational (has weird treatment assignment) and one is randomized properly. That part I know a lot about, that’s classical causal inference—how to massage observational data to make it look like an RCT.
I would advise thinking about these problems separately, that is start trying to solve combining two RCTs.
edit: I know you are trying to describe things to me on the level of individual points to help me understand. But I think a more helpful way to go is to ignore sampling variability entirely, and just start with two joint distributions P1 and P2 that represent variables in your two studies (in other words you assume infinite sample size, so you get the distributions exactly). How do we combine them into a single conclusion (let’s say the “average causal effect”: difference in outcome means under treatment vs placebo)? Even this is not so easy to work out.
I would advise thinking about these problems separately, that is start trying to solve combining two RCTs.
I think when you break it into two separate problems like that, you miss the point. Combining two RCTs is reasonably well-solved by multilevel random effects models. I’m also not trying to solve the problem of inferring from a correlational dataset to specific causal models, which seems well in hand by Pearlean approaches. I’m trying to bridge between the two: assume a specific generative model for correlation vs causation and then infer the distribution.
How do we combine them into a single conclusion (let’s say the “average causal effect”: difference in outcome means under treatment vs placebo)?
But this is exactly the problem! Apparently, there is no meaningful ‘average causal effect’ between correlational and causational studies. In one study, it was much larger; in the next, it was a little smaller; in the next, it was much smaller; in the one after that, the sign reversed… If you create a regular multilevel meta-analysis of a bunch of randomized and correlational studies, say, and you toss in a fixed-effect covariate and regress ‘Y ~ Randomized’, you get an estimate of ~0. The actual effect in each case may be quite large, but the average over all the studies is a wash.
This is different from other methodological problems. With placebos, there is a predictable systematic bias which gives you a large positive bias. Likewise, publication bias skews effects up. Likewise, non-blinding of raters. And so on and so forth. You can easily estimate with an additive fixed-effect / linear model and correct for particular biases. But with random vs correlation, it seems that there’s no particular direction the effects head in, you just know that whatever they are, they’ll be different from your correlational results. So you need to do something more imaginative in modeling.
But I think a more helpful way to go is to ignore sampling variability entirely, and just start with two joint distributions P1 and P2 that represent variables in your two studies (in other words you assume infinite sample size, so you get the distributions exactly).
OK, let’s imagine all our studies are infinite sized. I collect 5 study-pairs, correlational vs randomized, d effect size:
0.5 vs 0.1 (difference: 0.4)
-0.22 vs −0.22 (difference: 0)
0.8 vs −0.2 (difference: −1.0)
0.3 vs 0.3 (difference: 0
0.5 vs −0.1 (difference: 0.6)
I apply my mixture model strategy.
We see that in study #2 and #4, the correlational and causal effects are identical, 100% confidence, and thus both were drawn from the randomized distribution. With two datapoints −0.22 and 0.3, we begin to infer that the distribution of causal effects is probably fairly narrow around 0 and we update our normal distribution appropriately to be skeptical about any claims of large causal effects.
We see in study #1, #3, and #5, that the correlational and causal effects differ, 100% confidence, and thus we know that the correlational effect for that particular treatment was drawn from the general correlational distribution. The correlational effects are .5, -.8. .5 - all quite large, and so we infer that correlational effects tend to be quite large and its distribution has a large standard deviation (or whatever).
We then note that in 2⁄5 of the pairs, the correlational effect was the causal effect, and so we estimate that the probability of a correlational effect having been drawn from the causal distribution rather than the correlation distribution is P=2/5. Or in other words, correlation=causality 40% of the time. However, if we had tried to calculate an additive variable like in a meta-regression, we would find that the Randomized covariate was estimated at exactly 0 (mean(c(0.4, 0, -1.0, 0, 0.6)) ~> [1] 0) and certainly is not statistically-significant.
Now when someone comes to us with an infinite-sized correlational trial that purified Egyptian mummy reduces allergy symptoms by d=0.5, we feed it into our mixture model and we get a useful posterior distribution which exhibits a bimodal pattern where it is heavily peaked at 0 (reflecting the more-likely-than-not scenario that mummy is mummery) but also peaked at d=0.4 or so, reflecting shrinkage of the scenario that mummy is munificent, which will predict better than if we naively tried to just shift the d=0.5 posterior distribution up or down some units.
The problem with real studies is that they are not infinitely sized, so when the point-estimates disagree and we get 0.45 vs 0.5, obviously we cannot strongly conclude which distribution in the mixture it was drawn from, and so we need to propagate that uncertainty through the whole model and all its parameters.
I think when you break it into two separate problems like that, you miss the point.
I am pretty sure I am not, but let’s see. What you are basically saying is “analysis ⇒ synthesis doesn’t work.”
Combining two RCTs is reasonably well-solved by multilevel random effects models.
Hierarchical models are a particular parametric modeling approach for data drawn from multiple sources. People use this type of stuff to good effect, but saying it “solves the problem” here is sort of like saying linear regression “solves” RCTs. What if the modeling assumptions are wrong? What if you are not sure what the model should be?
I’m also not trying to solve the problem of inferring from a correlational dataset to specific causal models, which > seems well in hand by Pearlean approaches.
Let’s call them “interventionist approaches.” Pearl is just the guy people here read. People have been doing causal analysis from observational data since at least the 70s, probably earlier in certain special cases.
I’m trying to bridge between the two: assume a specific generative model for correlation vs causation and then > infer the distribution.
Ok.
But this is exactly the problem! Apparently, there is no meaningful ‘average causal effect’ between correlational
and causational studies.
This is what we should talk about.
If there is one RCT, we have a treatment A (with two levels a, and a’) and outcome Y. Of interest is outcome under hypothetical treatment assignment to a value, which we write Y(a) or Y(a’). “Average causal effect” is E[Y(a)] - E[Y(a’)]. So far so good.
If there is one observational study, say A is assigned based on C, and C affects Y, what is of interest is still Y(a) or Y(a’). Interventionist methods would give you a formula for E[Y(a)] - E[Y(a’)] in terms of p(A,C,Y). You can then construct an estimator for that formula, and life is good. So far so good.
Note that so far I made no modeling assumptions on the relationship of A and Y at all. It’s all completely unrestricted by choice of statistical model. I can do crazy non-parametric random forest to model the relationship of A and Y if I wanted. I can do linear regression. I can do whatever. This is important—people often smuggle in modeling assumptions “too soon.” When we are talking about prediction problems like in machine learning, that’s ok. We don’t care about modeling too much we just want good predictive performance. When we care about effects, the model is important. This is because if the effect is not strong and your model is garbage, it can mislead you.
If there are two RCTs, we have two sets of outcomes: Y1(a), Y1(a’) and Y2(a), Y2(a’). Even here, there is no one causal effect so far. We need to make some sort of assumption on how to combine these. For example,
we may try to generalize regression models, and say that a lot of the way A affects Y is the same regression across the two studies, but some of the regression terms are allowed to differ to model population heterogeneity.
This is what hierarchical models do.
In general we have E[f(Y1(a), Y2(a))] - E[f(Y1(a’),Y2(a’))], for some f(.,.) that we should justify. At this level, things are completely non-parametric. We can model the relationship of A and Y1,Y2 however we want. We can model f however we want.
If we have one RCT and one observational study, we still have Y1(a), Y1(a’) for the RCT, and Y2(a), Y2(a’) for the observational study. To determine the latter we use “interventionist approaches” to express them in terms of observational data. We then combine things using f(.,.) as before. As before we should justify all the modeling we are doing.
I am pretty sure Barenboim thought about this stuff (but he doesn’t do statistical inference, just the general setup).
What you are basically saying is “analysis ⇒ synthesis doesn’t work.”
I am pretty sure it is not going to let you take an effect size and a standard error from a correlation study and get out a accurate posterior distribution of the causal effect without doing something similar to what I’m proposing.
If there are two RCTs, we have two sets of outcomes: Y1(a), Y1(a’) and Y2(a), Y2(a’). Even here, there is no one causal effect so far. We need to make some sort of assumption on how to combine these. For example, we may try to generalize regression models, and say that a lot of the way A affects Y is the same regression across the two studies, but some of the regression terms are allowed to differ to model population heterogeneity. This is what hierarchical models do. In general we have E[f(Y1(a), Y2(a))] - E[f(Y1(a’),Y2(a’))], for some f(.,.) that we should justify. At this level, things are completely non-parametric. We can model the relationship of A and Y1,Y2 however we want. We can model f however we want.
Ok, and how do we model them? I am proposing a multilevel mixture model to compare them.
If we have one RCT and one observational study, we still have Y1(a), Y1(a’) for the RCT, and Y2(a), Y2(a’) for the observational study. To determine the latter we use “interventionist approaches” to express them in terms of observational data. We then combine things using f(.,.) as before. As before we should justify all the modeling we are doing.
Which is not going to work since in most, if not all, of these studies, the original patient-level data is not going to be available and you’re not even going to get a correlation matrix out of the published paper, and I haven’t seen any intervention-style algorithms which work with just the effect sizes which is what is on offer.
To work with the sparse data that is available, you are going to have to do something in between a meta-analysis and an interventionist analysis.
I am proposing a multilevel mixture model to compare them.
Ok. You can use whatever statistical model you want, as long as we are clear what the underlying object is you are dealing with. The difficulty here isn’t the statistical modeling, but being clear about what it is that is being estimated (in other words the interpretation of the parameters of the model). This is why I don’t talk about statistical modeling at first.
haven’t seen any intervention-style algorithms which work with just the effect sizes which is what is on offer.
If all you have is reported effect sizes you won’t get anything good out. You need the data they used.
Depends on what you want. It doesn’t matter “who has priority” when it comes to learning the subject. Pearl’s book is good, but one big disadvantage of reading just Pearl is Pearl does not deal with the statistical inference end of causal inference very much (by choice). Actually, I heard Pearl has a new book in the works, more suitable for teaching.
But ultimately we must draw causal conclusions from actual data, so statistical inference is important. Some big names that combine causal and statistical inference: Jamie Robins, Miguel Hernan, Eric Tchetgen Tchetgen, Tyler VanderWeele (Harvard causal group), Mark van der Laan (Berkeley), Donald Rubin et al (Harvard), Frangakis, Rosenblum, Scharfstein, etc. (Johns Hopkins causal group), Andrea Rotnitzky (Harvard), Susan Murphy (Michigan), Thomas Richardson (UW), Phillip Dawid (Cambridge, but retired, incidentally the inventor of conditional independence notation). Lots of others.
You’re using correlation in what I would consider a weird way. Randomization is intended to control for selection effects to reduce confounds, but when somebody says correlational study I get in my head that they mean an observational study in which no attempt was made to determine predictive causation. When an effect shows up in a nonrandomized study, it’s not that you can’t determine whether the effect was causative; it’s that it’s more difficult to determine whether the causation was due to the independent variable or an extraneous variable unrelated to the independent variable. It’s not a question of whether the effect is due to correlation or causation, but whether the relationship between the independent and dependent variable even exists at all.
(1) Observational studies are almost always attempts to determine causation. Sometimes the investigators try to pretend that they aren’t, but they aren’t fooling anyone, least of all the general public. I know they are attempting to determine causation because nobody would be interested in the results of the study unless they were interested in causation. Moreover, I know they are attempting to determine causation because they do things like “control for confounding”. This procedure is undefined unless the goal is to estimate a causal effect
(2) What do you mean by the sentence “the study was causative”? Of course nobody is suggesting that the study itself had an effect on the dependent variable?
(3) Assuming that the statistics were done correctly and that the investigators have accounted for sampling variability, the relationship between the independent and dependent variable definitely exists. The correlation is real, even if it is due to confounding. It just doesn’t represent a causal effect
You are assuming a couple of things which are almost always true in your (medical) field, but are not necessarily true in general. For example,
Observational studies are almost always attempts to determine causation
Nope. Another very common reason is to create a predictive model without caring about actual causation. If you can’t do interventions but would like to forecast the future, that’s all you need.
Assuming that the statistics were done correctly and that the investigators have accounted for sampling variability, the relationship between the independent and dependent variable definitely exists.
That further assumes your underlying process is stable and is not subject to drift, regime changes, etc. Sometimes you can make that assumption, sometimes you cannot.
Another very common reason is to create a predictive model without caring about actual causation. If you can’t do interventions but would like to forecast the future, that’s all you need.
You’d also like a guarantee that others can’t do interventions, or else your measure could be gamed. (But if there’s an actual causal relationship, then ‘gaming’ isn’t really possible.)
(1) I just think calling a nonrandomized study a correlational study is weird.
(2) I meant to say effect; not study; fixed
(3) If something is caused by a confounding variable, then the independent variable may have no relationship with the dependent variable. You seem to be using correlation to mean the result of an analysis, but I’m thinking of it as the actual real relationship which is distinct from causation. So y=x does not mean y causes x or that x causes y.
I don’t understand what you mean by “real relationship”. I suggest tabooing the terms “real relationship” and “no relationship”.
I am using the word “correlation” to discuss whether the observed variable X predicts the observed variable Y in the (hypothetical?) superpopulation from which the sample was drawn. Such a correlation can exist even if neither variable causes the other.
If X predicts Y in the superpopulation (regardless of causality), the correlation will indeed be real. The only possible definition I can think of for a “false” correlation is one that does not exist in the superpopulation, but which appears in your sample due to sampling variability. Statistical methodology is in general more than adequate to discuss whether the appearance of correlation in your sample is due to real correlation in the superpopulation. You do not need causal inference to reason about this question. Moreover, confounding is not relevant.
Confounding and causal inference are only relevant if you want to know whether the correlation in the superpopulation is due to the causal effect of X on Y. You can certainly define the causal effect as the “actual real relationship”, but then I don’t understand how it is distinct from causation.
The only possible definition I can think of for a “false” correlation is one that does not exist in the superpopulation, but which appears in your sample due to sampling variability.
Right. Which is the problem randomization attempts to correct for, which I think of as a separate problem from causation.
Intersample variability is a type of confound. Increasing sample size is another method for reducing confounding due to intersample variability. Maybe you meant intrasample variability, but that doesn’t make much sense to me in context. Maybe you think of intersample variability as sampling error? Or maybe you have a weird definition of confounding?
Either way, confounding is a separate problem from causation. You can isolate the confounding variables from the independent variable to determine the correlation between x and y without determining a causal relationship. You can also determine the presence of a causal relationship without isolating the independent variable from possible confounding variables.
The nonrandomized studies are determining causality; they’re just doing a worse job at isolating the independent variable, which is what gwern appears to be talking about here.
Or maybe you have a weird definition of confounding?
I use the standard definition of confounding based on whether E(Y| X=x) = E(Y| Do(X=x)), and think about it in terms of whether there exists a backdoor path between X and Y.
Either way, confounding is a separate problem from causation.
The concept of confounding is defined relative to the causal query of interest. If you don’t believe me, try to come up with a coherent definition of confounding that does not depend on the causal question.
You can isolate the confounding variables from the independent variable to determine the correlation between x and y without determining a causal relationship.
With standard statistical techniques you will be able to determine the correlation between X and Y. You will also be able to determine the correlation between X and Y conditional on Z. These are both valid questions and they are both are true correlations. Whether either of those correlations is interesting depends on your causal question and on whether Z is a confounder for that particular query.
You can also determine the presence of a causal relationship without isolating the independent variable from possible confounding variables.
No you can’t. (Unless you have an instrumental variable, in which case you have to make the assumption that the instrument is unconfounded instead of the treatment of interest)
(re: last sentence, also have to assume no direct effect of instrument, but I am sure you knew that, just emphasizing the confounding assumption since discussion is about confounding).
Grand parent’s attitude is precisely what is wrong with LW culture’s complete and utter lack of epistemic/social humility (which I think they inherited from Yudkowsky and his planet-sized ego). Him telling you of all people that you are using a weird definition of confounding is incredibly amusing.
Correlation!=causation: returning to my old theme (latest example: is exercise/mortality entirely confounded by genetics?), what is the right way to model various comparisons?
By which I mean, consider a paper like “Evaluating non-randomised intervention studies”, Deeks et al 2003 which does this:
So we get pairs of studies, more or less testing the same thing except one is randomized and the other is correlational. Presumably this sort of study-pair dataset is exactly the kind of dataset we would like to have if we wanted to learn how much we can infer causality from correlational data.
But how, exactly, do we interpret these pairs? If one study finds a CI of 0-0.5 and the counterpart finds 0.45-1.0, is that confirmation or rejection? If one study finds −0.5-0.1 and the other 0-0.5, is that confirmation or rejection? What if they are very well powered and the pair looks like 0.2-0.3 and 0.4-0.5? A criterion of overlapping confidence intervals is not what we want.
We could try to get around it by making a very strict criterion: ‘what fraction of pairs have confidence intervals excluding zero for both studies, and the studies are opposite signed?’ This seems good: if one study ‘proves’ that X is helpful and the other study ‘proves’ that X is harmful, then that’s as clearcut a case of correlation!=causation as one could hope for. With a pair of studies like −0.5/-0.1 and +0.1-+0.5, that is certainly a big problem.
The problem with that is that it is so strict that we would hardly ever conclude a particular case was correlation!=causation (few of the known examples are so wellpowered clearcut), leading to systematic overoptimism, and it inherits the typical problems of NHST like generally ignoring costs (if exercise reduces mortality by 50% in correlational studies and 5% in randomized studies, then to some extent correlation=causation but the massive overestimate could easily tip exercise from being worthwhile to not being worthwhile).
We also can’t simply do a two-group comparison and get a result like ‘correlational studies always double the effect on average, so to correct, just halve the effect and then see if that is still statistically-significant’, which is something you can do with, say, blinding or publication bias because it turns out to not be that conveniently simple—it’s not an issue of researchers predictably biasing ratings toward the desired higher outcome or publishing only the results/studies which show the desired results. The randomized experiments seem to turn in larger, smaller, or opposite-signed results at, well, random.
This is a similar problem as with the Reproducibility Project: we would like the replications of the original psychology studies to tell us, in some sense, how ‘trustworthy’ we can consider psychology studies in general. But most of the methods seem to diagnose lack of power as much as anything (the replications were generally powered 80%+, IIRC, which still means that a lot will not be statistically-significant even if the effect is real). Using Bayes factors is helpful in getting us away from p-values but still not the answer.
It might help to think about what is going on in a generative sense. What do I think creates these results? I would have to say that the results are generally being driven by a complex causal network of genes, biochemistry, ethnicity, SES, varying treatment methods etc which throws up an even more complex & enormous set of multivariate correlations (which can be either positive or negative), while effective interventions are few & rare (likewise, can be both positive or negative) but drive the occasional correlation as well. When a correlation is presented by a researcher as an effective intervention, it might be drawn from the large set of pure correlations or it might have come from the set of causals. It is unlabeled and we are ignorant of which group it came from. There is no oracle which will tell us that a particular correlation is or is not causal (that would make life too easy), but then (in this case) we can test it, and get a (usually small) amount of data about what it does in a randomized setting. How do we analyze this?
I would say that what we have here is something quite specific: a mixture model. Each intervention has been drawn from a mixture of two distributions, all-correlation (with a wide distribution allowing for many large negative & positive values) and causal effects (narrow distribution around zero with a few large values), but it’s unknown which of the two it was drawn from and we are also unsure what the probability of drawing from one or the other is. (The problem is similar to my earlier noisy polls: modeling potentially falsified poll data.)
So when we run a study-pair through this, then if they are not very discrepant, the posterior estimate shifts towards having drawn from the causal group in that case—and also slightly increases the overall estimate of the probability of drawing from the causal group; and vice-versa if they are heavily discrepant, in which case it becomes much more probable that there was a draw from the correlational group, and slightly more probable that draws from the correlation group are more common. At the end of doing this for all the study-pairs, we get estimates of causal/correlation posterior probability for each particular study-pair (which automatically adjusts for power etc and can be further used for decision-theory like ’does this reduce the expected value of the specific treatment of exercise to <=$0?), but we also get an overall estimate of the switching probability—which tells us in general how often we can expect tested correlations like these to be causal.
I think this gives us everything we want. Working with distributions avoids the power issues, for any specific treatment we can give estimates of being causal, we get an overall estimate as a clear unambiguous probability, etc.
Hi.
I am not sure I understand your question.
If I got such data I would (a) be very happy, (b) use the RCT to inform policy, and (c) use the pair to point out how correct causal inference methods can recover the RCT result if assumptions hold (hopefully they hold in the observational study). We can try to combine strength of two studies, but then the results live or die by assumptions on how treatments were assigned in the observational study.
I am also not a fan of classifying biases like they do (it’s a common silly practice). For example, it’s really not informative to say “confounding bias,” in reality you can have a lot of types of confounding, with different solutions necessary depending on the type (I like to draw pictures to understand this).
I think Robins et al (?Hernan?) at some point recovered the result of an RCT via his g methods from observational data.
The paper you are referring to is “Observational Studies Analyzed Like Randomized Experiments: An application to Postmenopausal Hormone Therapy and Coronary Heart Disease” by Hernan et al. It is available at https://cdn1.sph.harvard.edu/wp-content/uploads/sites/343/2013/03/observational-studies.pdf
The controversy about hormone replacement therapy is fascinating as a case study. Until 2002, essentially all women who reached menopause got medical advise to start taking pills containing horse estrogen. It was very widely believed that this would reduce their risk of having a heart attack. This belief primarily based on biological plausibility: Estrogen is known to reduce cholesterol, and cholesterol is believed to increase the risk of heart disease. Also, there were many observational studies that seemingly suggested that women who took hormone replacement therapy (HRT) had less risk of heart disease. (In my view, this was not surprising: Observational studies always show what the investigators expect to find.)
In 2002, the Women’s Health Initiative randomized trial was stopped early because it showed that estrogen replacement therapy actually substantially increased the risk of having a heart attack. Overnight, the medical establishment stopped recommending estrogen for menopausal women. But a perhaps more important consequence was that many clinicians stopped trusting observational studies altogether. In my opinion, this was mostly a good thing.
The largest observational study to show a protective effect of estrogen the Nurses Health Study. In 2008, my thesis advisor Miguel Hernan re-analyzed this dataset using Jamie Robins’ g-methods (which are equivalent to Pearl), and was essentially able to reproduce the results of the WHI trial. Miguel’s paper uses valid methods and gets the correct results. In my view, this shows that the new methods might work, but the paper would have meant much more if it was published prior to the randomized trials.
Miguel and Jamie’s paper sparked off a very interesting methodological debate with the original investigators at the Nurses Health Study, who are still clinging to their original analysis. See http://www.ncbi.nlm.nih.gov/pubmed/18813017 .
Many people still believe that Estrogen/HRT is beneficial. The most popular theory is that WHI recruited too many old women (sometimes in their 90s!) and that estrogen is harmful if given that long after menopause. A new randomized trial which is limited to women at menopause is currently being conducted. A second theory is that the results in the trial were due to differences in statin usage. I analyzed the second theory for my doctoral thesis, but found that this had negligible impact on the results.
It is also interesting to note that while it is true that the trial found that estrogen increased the risk of heart disease, it also showed a (non-significant) reduction in all-cause mortality. So the increased risk of cardiovascular disease didn’t even result in more deaths. Presumably, people care more about all-cause mortality than heart attacks. However, since it was “non-significant”, not even the most dedicated proponents of estrogen treatment ever point out this fact.
A side question, prompted by an amusing factoid in the Hernan paper: ”...we restricted the population to women who had reported plausible energy intakes (2510 –14,640 kJ/d)”.
In the statistical analysis in this paper, and also as a general practice in medical publications based on questionnaire data, are there adjustments for uncertainty in the questionnaire responses?
When you have a data point that says, for example, that person #12345 reports her caloric intake as 4,000 calories/day, do you take it as a hard precise number, or do you take it as an imprecise estimate with its own error which propagates into the model uncertainty, etc.?
Keyword is “measurement error.” People think hard about this. Anders_H knows this paper in a lot more detail than I do, but I expect these particular authors to be careful.
This issue is also related to “missing data.” What you see might be different from the underlying truth in systematic ways, e.g. you get systematic bias in your data, and you need to deal with that. This is also related to that causal inference stuff I keep going on about.
People like engineers and physicists think a lot about this. I am not sure that medical researchers think a lot about this. The usual (easy) way is to throw out unreasonable-looking responses during the data cleaning and then take what remains as rock-solid. Accepting that your independent variables are uncertain leads to a lot of inconvenient problems (starting with the OLS regression not being a theoretically-correct form any more).
Yes, that’s another can of worms. In some areas (e.g. self-reported food intake) the problem is so blatant and overwhelming that you have to deal with it, but if it looks minor not many people want to bother.
Clinicians do not, “methodology people” (who often partner up with “domain experts”) to do data analysis, absolutely do.
Yes, I was told the full gory details of this story (not going to repeat it here). Thanks for sharing this!
By the way, are you at Stanford now? I should find a way to drop by, Jacob’s there too.
Just putting the idea out for comment in case there’s some way this fails to deliver what I want it to deliver. Excerpting out all the comparisons and writing up the mixture model in JAGS would be a lot of work; just reading the papers takes long enough as it is.
Indeed. You can imagine that when I stumbled across Deeks and the rest of then in Google Scholar (my notes), I was overjoyed by their obvious utility (and because it meant I didn’t have to do it myself, as I was musing about doing using FDA trials) but also completely baffled: why had I never heard of these papers before?
I am not following your mixture model idea. For every data point you know if it comes from the RCT or observational study. You don’t need uncertainty about treatment assignment. What you need is figuring out how to massage observational data to get causal conclusions (e.g. what I think about all day long).
If you have specific observational data you want to look at, email me if you want to chat more.
No, the uncertainty here isn’t about which of the two studies a datapoint came from, but about whether (for a specific treatment/intervention) the correlational study datapoint was drawn from the same distribution as the randomized study datapoint or a different one, and (over all treatments/interventions) what the probability of being drawn from the same distribution is. Maybe it’ll be a little clearer if I narrate how the model might go.
So say you start off with a prior probability of 50-50 about which group a result is drawn from, a switching probability that will be tweaked as you look at data. (If you are studying turtles which could be from a large or a small species, then if you find 2 larger turtles and 8 smaller, you’re probably going to update from P=0.5 to a mixture probability more like P>0.20, since it’s most likely—but not certain—that 1 or 2 of the larger turtles came from the large species and the 8 smaller ones came from the small species.)
For your first datapoint, you have a pair of results: xyzcillin reduces all-cause mortality to RR=0.5 from a correlational study (cohort, cross-sectional, case-control, whatever), and the randomized study of xyzcillin has RR=1.1. What does this mean? Now, of course you know that 0.5 is the correlational result and 1.1 is the randomized result, but we can imagine two relatively distinct scenarios here: ‘xyzcillin actually works but the causal effect is really more like RR=0.7 and the randomized trial was underpowered’, or, ‘xyzcillin has no causal effect whatsoever on mortality and it’s just a bunch of powerful confounds producing results like RR=0.6-0.8’. We observe that 1.1 supports the latter more, and we update towards ‘xyzcillin has 0 effect’ and now give ‘non-causal scenarios are 55% likely’, but not too much because the xyzcillin studies were small and underpowered and so they don’t support the latter scenario that much.
Then for the next datapoint, ‘abcmycin reduces lung cancer’, we get a pair looking like 0.9 and 0.7, and we observe these large trials are very consistent with each other and so they highly support the former theory instead and we update towards ‘abcmycin causally reduces lung cancer’ and ‘noncausal scenarios are 39% likely’.
Then for the third datapoint about defracic surgery for backpain, we again get consistency like d=0.7 and d=0.5 and we update the probability that ‘defracic surgery reduces back pain’ and also push even further ’noncausal scenarios are 36% likely” because their sample sizes were decent.
And we do update for each pair we finish, and after bouncing back and forth with each pair, we wind up with an estimate that Nature draws from the non-causal scenario 37% of the time (ie the switching probability of the mixture is p=0.37). And now we can use that as a prior in evaluating any new medicine or surgery.
If you want to look at specific study-pairs, they’re all listed & properly cited in the papers I’ve collated & provided fulltext links for. I suspect that the more advanced methods will require individual level patient data, which sadly only a very few studies will release, but perhaps you can still find enough of those to make it worth your while and analyze if Robins et al can get a publishable paper out of just 1 RCT.
If I understood you correctly, there are two separate issues here.
The first is what people call “transportability” (how to sensibly combine results of multiple studies if units in those studies aren’t the same). People try all sorts of things (Gelman does random effects models I think?) Pearl’s student Elias Barenboim (now at Purdue) thinks about that stuff using graphs.
I wish I could help, but I don’t know as much about this subject as I want. Maybe I should think about it more.
The second issue is that in addition to units in two studies “not being the same” one study is observational (has weird treatment assignment) and one is randomized properly. That part I know a lot about, that’s classical causal inference—how to massage observational data to make it look like an RCT.
I would advise thinking about these problems separately, that is start trying to solve combining two RCTs.
edit: I know you are trying to describe things to me on the level of individual points to help me understand. But I think a more helpful way to go is to ignore sampling variability entirely, and just start with two joint distributions P1 and P2 that represent variables in your two studies (in other words you assume infinite sample size, so you get the distributions exactly). How do we combine them into a single conclusion (let’s say the “average causal effect”: difference in outcome means under treatment vs placebo)? Even this is not so easy to work out.
I think when you break it into two separate problems like that, you miss the point. Combining two RCTs is reasonably well-solved by multilevel random effects models. I’m also not trying to solve the problem of inferring from a correlational dataset to specific causal models, which seems well in hand by Pearlean approaches. I’m trying to bridge between the two: assume a specific generative model for correlation vs causation and then infer the distribution.
But this is exactly the problem! Apparently, there is no meaningful ‘average causal effect’ between correlational and causational studies. In one study, it was much larger; in the next, it was a little smaller; in the next, it was much smaller; in the one after that, the sign reversed… If you create a regular multilevel meta-analysis of a bunch of randomized and correlational studies, say, and you toss in a fixed-effect covariate and regress ‘Y ~ Randomized’, you get an estimate of ~0. The actual effect in each case may be quite large, but the average over all the studies is a wash.
This is different from other methodological problems. With placebos, there is a predictable systematic bias which gives you a large positive bias. Likewise, publication bias skews effects up. Likewise, non-blinding of raters. And so on and so forth. You can easily estimate with an additive fixed-effect / linear model and correct for particular biases. But with random vs correlation, it seems that there’s no particular direction the effects head in, you just know that whatever they are, they’ll be different from your correlational results. So you need to do something more imaginative in modeling.
OK, let’s imagine all our studies are infinite sized. I collect 5 study-pairs, correlational vs randomized, d effect size:
0.5 vs 0.1 (difference: 0.4)
-0.22 vs −0.22 (difference: 0)
0.8 vs −0.2 (difference: −1.0)
0.3 vs 0.3 (difference: 0
0.5 vs −0.1 (difference: 0.6)
I apply my mixture model strategy.
We see that in study #2 and #4, the correlational and causal effects are identical, 100% confidence, and thus both were drawn from the randomized distribution. With two datapoints −0.22 and 0.3, we begin to infer that the distribution of causal effects is probably fairly narrow around 0 and we update our normal distribution appropriately to be skeptical about any claims of large causal effects.
We see in study #1, #3, and #5, that the correlational and causal effects differ, 100% confidence, and thus we know that the correlational effect for that particular treatment was drawn from the general correlational distribution. The correlational effects are .5, -.8. .5 - all quite large, and so we infer that correlational effects tend to be quite large and its distribution has a large standard deviation (or whatever).
We then note that in 2⁄5 of the pairs, the correlational effect was the causal effect, and so we estimate that the probability of a correlational effect having been drawn from the causal distribution rather than the correlation distribution is P=2/5. Or in other words, correlation=causality 40% of the time. However, if we had tried to calculate an additive variable like in a meta-regression, we would find that the Randomized covariate was estimated at exactly 0 (
mean(c(0.4, 0, -1.0, 0, 0.6)) ~> [1] 0
) and certainly is not statistically-significant.Now when someone comes to us with an infinite-sized correlational trial that purified Egyptian mummy reduces allergy symptoms by d=0.5, we feed it into our mixture model and we get a useful posterior distribution which exhibits a bimodal pattern where it is heavily peaked at 0 (reflecting the more-likely-than-not scenario that mummy is mummery) but also peaked at d=0.4 or so, reflecting shrinkage of the scenario that mummy is munificent, which will predict better than if we naively tried to just shift the d=0.5 posterior distribution up or down some units.
The problem with real studies is that they are not infinitely sized, so when the point-estimates disagree and we get 0.45 vs 0.5, obviously we cannot strongly conclude which distribution in the mixture it was drawn from, and so we need to propagate that uncertainty through the whole model and all its parameters.
I am pretty sure I am not, but let’s see. What you are basically saying is “analysis ⇒ synthesis doesn’t work.”
Hierarchical models are a particular parametric modeling approach for data drawn from multiple sources. People use this type of stuff to good effect, but saying it “solves the problem” here is sort of like saying linear regression “solves” RCTs. What if the modeling assumptions are wrong? What if you are not sure what the model should be?
Let’s call them “interventionist approaches.” Pearl is just the guy people here read. People have been doing causal analysis from observational data since at least the 70s, probably earlier in certain special cases.
Ok.
This is what we should talk about.
If there is one RCT, we have a treatment A (with two levels a, and a’) and outcome Y. Of interest is outcome under hypothetical treatment assignment to a value, which we write Y(a) or Y(a’). “Average causal effect” is E[Y(a)] - E[Y(a’)]. So far so good.
If there is one observational study, say A is assigned based on C, and C affects Y, what is of interest is still Y(a) or Y(a’). Interventionist methods would give you a formula for E[Y(a)] - E[Y(a’)] in terms of p(A,C,Y). You can then construct an estimator for that formula, and life is good. So far so good.
Note that so far I made no modeling assumptions on the relationship of A and Y at all. It’s all completely unrestricted by choice of statistical model. I can do crazy non-parametric random forest to model the relationship of A and Y if I wanted. I can do linear regression. I can do whatever. This is important—people often smuggle in modeling assumptions “too soon.” When we are talking about prediction problems like in machine learning, that’s ok. We don’t care about modeling too much we just want good predictive performance. When we care about effects, the model is important. This is because if the effect is not strong and your model is garbage, it can mislead you.
If there are two RCTs, we have two sets of outcomes: Y1(a), Y1(a’) and Y2(a), Y2(a’). Even here, there is no one causal effect so far. We need to make some sort of assumption on how to combine these. For example, we may try to generalize regression models, and say that a lot of the way A affects Y is the same regression across the two studies, but some of the regression terms are allowed to differ to model population heterogeneity. This is what hierarchical models do.
In general we have E[f(Y1(a), Y2(a))] - E[f(Y1(a’),Y2(a’))], for some f(.,.) that we should justify. At this level, things are completely non-parametric. We can model the relationship of A and Y1,Y2 however we want. We can model f however we want.
If we have one RCT and one observational study, we still have Y1(a), Y1(a’) for the RCT, and Y2(a), Y2(a’) for the observational study. To determine the latter we use “interventionist approaches” to express them in terms of observational data. We then combine things using f(.,.) as before. As before we should justify all the modeling we are doing.
I am pretty sure Barenboim thought about this stuff (but he doesn’t do statistical inference, just the general setup).
I am pretty sure it is not going to let you take an effect size and a standard error from a correlation study and get out a accurate posterior distribution of the causal effect without doing something similar to what I’m proposing.
Ok, and how do we model them? I am proposing a multilevel mixture model to compare them.
Which is not going to work since in most, if not all, of these studies, the original patient-level data is not going to be available and you’re not even going to get a correlation matrix out of the published paper, and I haven’t seen any intervention-style algorithms which work with just the effect sizes which is what is on offer.
To work with the sparse data that is available, you are going to have to do something in between a meta-analysis and an interventionist analysis.
Ok. You can use whatever statistical model you want, as long as we are clear what the underlying object is you are dealing with. The difficulty here isn’t the statistical modeling, but being clear about what it is that is being estimated (in other words the interpretation of the parameters of the model). This is why I don’t talk about statistical modeling at first.
If all you have is reported effect sizes you won’t get anything good out. You need the data they used.
Is there anyone you would recommend studying in addition?
Depends on what you want. It doesn’t matter “who has priority” when it comes to learning the subject. Pearl’s book is good, but one big disadvantage of reading just Pearl is Pearl does not deal with the statistical inference end of causal inference very much (by choice). Actually, I heard Pearl has a new book in the works, more suitable for teaching.
But ultimately we must draw causal conclusions from actual data, so statistical inference is important. Some big names that combine causal and statistical inference: Jamie Robins, Miguel Hernan, Eric Tchetgen Tchetgen, Tyler VanderWeele (Harvard causal group), Mark van der Laan (Berkeley), Donald Rubin et al (Harvard), Frangakis, Rosenblum, Scharfstein, etc. (Johns Hopkins causal group), Andrea Rotnitzky (Harvard), Susan Murphy (Michigan), Thomas Richardson (UW), Phillip Dawid (Cambridge, but retired, incidentally the inventor of conditional independence notation). Lots of others.
I believe Stephen Cole posts here, and he does this stuff also (http://sph.unc.edu/adv_profile/stephen-r-cole-phd/).
Miguel Hernan and Jamie Robins are working on a new causal inference book that is more statistical, might be worth a look. Drafts available online:
http://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/
You specialise in identifying the determinants of biases in causal inference? Just curious :) Interesting
And how to make those biases go away, yes.
You’re using correlation in what I would consider a weird way. Randomization is intended to control for selection effects to reduce confounds, but when somebody says correlational study I get in my head that they mean an observational study in which no attempt was made to determine predictive causation. When an effect shows up in a nonrandomized study, it’s not that you can’t determine whether the effect was causative; it’s that it’s more difficult to determine whether the causation was due to the independent variable or an extraneous variable unrelated to the independent variable. It’s not a question of whether the effect is due to correlation or causation, but whether the relationship between the independent and dependent variable even exists at all.
(1) Observational studies are almost always attempts to determine causation. Sometimes the investigators try to pretend that they aren’t, but they aren’t fooling anyone, least of all the general public. I know they are attempting to determine causation because nobody would be interested in the results of the study unless they were interested in causation. Moreover, I know they are attempting to determine causation because they do things like “control for confounding”. This procedure is undefined unless the goal is to estimate a causal effect
(2) What do you mean by the sentence “the study was causative”? Of course nobody is suggesting that the study itself had an effect on the dependent variable?
(3) Assuming that the statistics were done correctly and that the investigators have accounted for sampling variability, the relationship between the independent and dependent variable definitely exists. The correlation is real, even if it is due to confounding. It just doesn’t represent a causal effect
You are assuming a couple of things which are almost always true in your (medical) field, but are not necessarily true in general. For example,
Nope. Another very common reason is to create a predictive model without caring about actual causation. If you can’t do interventions but would like to forecast the future, that’s all you need.
That further assumes your underlying process is stable and is not subject to drift, regime changes, etc. Sometimes you can make that assumption, sometimes you cannot.
You’d also like a guarantee that others can’t do interventions, or else your measure could be gamed. (But if there’s an actual causal relationship, then ‘gaming’ isn’t really possible.)
(1) I just think calling a nonrandomized study a correlational study is weird.
(2) I meant to say effect; not study; fixed
(3) If something is caused by a confounding variable, then the independent variable may have no relationship with the dependent variable. You seem to be using correlation to mean the result of an analysis, but I’m thinking of it as the actual real relationship which is distinct from causation. So y=x does not mean y causes x or that x causes y.
I don’t understand what you mean by “real relationship”. I suggest tabooing the terms “real relationship” and “no relationship”.
I am using the word “correlation” to discuss whether the observed variable X predicts the observed variable Y in the (hypothetical?) superpopulation from which the sample was drawn. Such a correlation can exist even if neither variable causes the other.
If X predicts Y in the superpopulation (regardless of causality), the correlation will indeed be real. The only possible definition I can think of for a “false” correlation is one that does not exist in the superpopulation, but which appears in your sample due to sampling variability. Statistical methodology is in general more than adequate to discuss whether the appearance of correlation in your sample is due to real correlation in the superpopulation. You do not need causal inference to reason about this question. Moreover, confounding is not relevant.
Confounding and causal inference are only relevant if you want to know whether the correlation in the superpopulation is due to the causal effect of X on Y. You can certainly define the causal effect as the “actual real relationship”, but then I don’t understand how it is distinct from causation.
Right. Which is the problem randomization attempts to correct for, which I think of as a separate problem from causation.
No. Randomization abolishes confounding, not sampling variability
If your problem is sampling variability, the answer is to increase the power.
If your problem is confounding, the ideal answer is randomization and the second best answer is modern causality theory.
Statisticians study the first problem, causal inference people study the second problem
Intersample variability is a type of confound. Increasing sample size is another method for reducing confounding due to intersample variability. Maybe you meant intrasample variability, but that doesn’t make much sense to me in context. Maybe you think of intersample variability as sampling error? Or maybe you have a weird definition of confounding?
Either way, confounding is a separate problem from causation. You can isolate the confounding variables from the independent variable to determine the correlation between x and y without determining a causal relationship. You can also determine the presence of a causal relationship without isolating the independent variable from possible confounding variables.
The nonrandomized studies are determining causality; they’re just doing a worse job at isolating the independent variable, which is what gwern appears to be talking about here.
No it isn’t
I use the standard definition of confounding based on whether E(Y| X=x) = E(Y| Do(X=x)), and think about it in terms of whether there exists a backdoor path between X and Y.
The concept of confounding is defined relative to the causal query of interest. If you don’t believe me, try to come up with a coherent definition of confounding that does not depend on the causal question.
With standard statistical techniques you will be able to determine the correlation between X and Y. You will also be able to determine the correlation between X and Y conditional on Z. These are both valid questions and they are both are true correlations. Whether either of those correlations is interesting depends on your causal question and on whether Z is a confounder for that particular query.
No you can’t. (Unless you have an instrumental variable, in which case you have to make the assumption that the instrument is unconfounded instead of the treatment of interest)
Anders_H, you are much more patient than I am!
(re: last sentence, also have to assume no direct effect of instrument, but I am sure you knew that, just emphasizing the confounding assumption since discussion is about confounding).
Grand parent’s attitude is precisely what is wrong with LW culture’s complete and utter lack of epistemic/social humility (which I think they inherited from Yudkowsky and his planet-sized ego). Him telling you of all people that you are using a weird definition of confounding is incredibly amusing.
I just realized the randomized-nonrandomized study was just an example and not what you were talking about.