So we get pairs of studies, more or less testing the same thing except one is randomized and the other is correlational.
If I got such data I would (a) be very happy, (b) use the RCT to inform policy, and (c) use the pair to point out how correct causal inference methods can recover the RCT result if assumptions hold (hopefully they hold in the observational study). We can try to combine strength of two studies, but then the results live or die by assumptions on how treatments were assigned in the observational study.
I am also not a fan of classifying biases like they do (it’s a common silly practice). For example, it’s really not informative to say “confounding bias,” in reality you can have a lot of types of confounding, with different solutions necessary depending on the type (I like to draw pictures to understand this).
I think Robins et al (?Hernan?) at some point recovered the result of an RCT via his g methods from observational data.
The controversy about hormone replacement therapy is fascinating as a case study. Until 2002, essentially all women who reached menopause got medical advise to start taking pills containing horse estrogen. It was very widely believed that this would reduce their risk of having a heart attack. This belief primarily based on biological plausibility: Estrogen is known to reduce cholesterol, and cholesterol is believed to increase the risk of heart disease. Also, there were many observational studies that seemingly suggested that women who took hormone replacement therapy (HRT) had less risk of heart disease. (In my view, this was not surprising: Observational studies always show what the investigators expect to find.)
In 2002, the Women’s Health Initiative randomized trial was stopped early because it showed that estrogen replacement therapy actually substantially increased the risk of having a heart attack. Overnight, the medical establishment stopped recommending estrogen for menopausal women. But a perhaps more important consequence was that many clinicians stopped trusting observational studies altogether. In my opinion, this was mostly a good thing.
The largest observational study to show a protective effect of estrogen the Nurses Health Study. In 2008, my thesis advisor Miguel Hernan re-analyzed this dataset using Jamie Robins’ g-methods (which are equivalent to Pearl), and was essentially able to reproduce the results of the WHI trial. Miguel’s paper uses valid methods and gets the correct results. In my view, this shows that the new methods might work, but the paper would have meant much more if it was published prior to the randomized trials.
Miguel and Jamie’s paper sparked off a very interesting methodological debate with the original investigators at the Nurses Health Study, who are still clinging to their original analysis. See http://www.ncbi.nlm.nih.gov/pubmed/18813017 .
Many people still believe that Estrogen/HRT is beneficial. The most popular theory is that WHI recruited too many old women (sometimes in their 90s!) and that estrogen is harmful if given that long after menopause. A new randomized trial which is limited to women at menopause is currently being conducted. A second theory is that the results in the trial were due to differences in statin usage. I analyzed the second theory for my doctoral thesis, but found that this had negligible impact on the results.
It is also interesting to note that while it is true that the trial found that estrogen increased the risk of heart disease, it also showed a (non-significant) reduction in all-cause mortality. So the increased risk of cardiovascular disease didn’t even result in more deaths. Presumably, people care more about all-cause mortality than heart attacks. However, since it was “non-significant”, not even the most dedicated proponents of estrogen treatment ever point out this fact.
A side question, prompted by an amusing factoid in the Hernan paper: ”...we restricted the population to women who had reported plausible energy intakes (2510 –14,640 kJ/d)”.
In the statistical analysis in this paper, and also as a general practice in medical publications based on questionnaire data, are there adjustments for uncertainty in the questionnaire responses?
When you have a data point that says, for example, that person #12345 reports her caloric intake as 4,000 calories/day, do you take it as a hard precise number, or do you take it as an imprecise estimate with its own error which propagates into the model uncertainty, etc.?
Keyword is “measurement error.” People think hard about this. Anders_H knows this paper in a lot more detail than I do, but I expect these particular authors to be careful.
This issue is also related to “missing data.” What you see might be different from the underlying truth in systematic ways, e.g. you get systematic bias in your data, and you need to deal with that. This is also related to that causal inference stuff I keep going on about.
Keyword is “measurement error.” People think hard about this.
People like engineers and physicists think a lot about this. I am not sure that medical researchers think a lot about this. The usual (easy) way is to throw out unreasonable-looking responses during the data cleaning and then take what remains as rock-solid. Accepting that your independent variables are uncertain leads to a lot of inconvenient problems (starting with the OLS regression not being a theoretically-correct form any more).
What you see might be different from the underlying truth in systematic ways, e.g. you get systematic bias in your data, and you need to deal with that.
Yes, that’s another can of worms. In some areas (e.g. self-reported food intake) the problem is so blatant and overwhelming that you have to deal with it, but if it looks minor not many people want to bother.
Just putting the idea out for comment in case there’s some way this fails to deliver what I want it to deliver. Excerpting out all the comparisons and writing up the mixture model in JAGS would be a lot of work; just reading the papers takes long enough as it is.
If I got such data I would (a) be very happy, (b) use the RCT to inform policy, and (c) use the pair to point out how correct causal inference methods can recover the RCT result if assumptions hold (hopefully they hold in the observational study)
Indeed. You can imagine that when I stumbled across Deeks and the rest of then in Google Scholar (my notes), I was overjoyed by their obvious utility (and because it meant I didn’t have to do it myself, as I was musing about doing using FDA trials) but also completely baffled: why had I never heard of these papers before?
I am not following your mixture model idea. For every data point you know if it comes from the RCT or observational study. You don’t need uncertainty about treatment assignment. What you need is figuring out how to massage observational data to get causal conclusions (e.g. what I think about all day long).
If you have specific observational data you want to look at, email me if you want to chat more.
For every data point you know if it comes from the RCT or observational study. You don’t need uncertainty about treatment assignment.
No, the uncertainty here isn’t about which of the two studies a datapoint came from, but about whether (for a specific treatment/intervention) the correlational study datapoint was drawn from the same distribution as the randomized study datapoint or a different one, and (over all treatments/interventions) what the probability of being drawn from the same distribution is. Maybe it’ll be a little clearer if I narrate how the model might go.
So say you start off with a prior probability of 50-50 about which group a result is drawn from, a switching probability that will be tweaked as you look at data. (If you are studying turtles which could be from a large or a small species, then if you find 2 larger turtles and 8 smaller, you’re probably going to update from P=0.5 to a mixture probability more like P>0.20, since it’s most likely—but not certain—that 1 or 2 of the larger turtles came from the large species and the 8 smaller ones came from the small species.)
For your first datapoint, you have a pair of results: xyzcillin reduces all-cause mortality to RR=0.5 from a correlational study (cohort, cross-sectional, case-control, whatever), and the randomized study of xyzcillin has RR=1.1. What does this mean? Now, of course you know that 0.5 is the correlational result and 1.1 is the randomized result, but we can imagine two relatively distinct scenarios here: ‘xyzcillin actually works but the causal effect is really more like RR=0.7 and the randomized trial was underpowered’, or, ‘xyzcillin has no causal effect whatsoever on mortality and it’s just a bunch of powerful confounds producing results like RR=0.6-0.8’. We observe that 1.1 supports the latter more, and we update towards ‘xyzcillin has 0 effect’ and now give ‘non-causal scenarios are 55% likely’, but not too much because the xyzcillin studies were small and underpowered and so they don’t support the latter scenario that much.
Then for the next datapoint, ‘abcmycin reduces lung cancer’, we get a pair looking like 0.9 and 0.7, and we observe these large trials are very consistent with each other and so they highly support the former theory instead and we update towards ‘abcmycin causally reduces lung cancer’ and ‘noncausal scenarios are 39% likely’.
Then for the third datapoint about defracic surgery for backpain, we again get consistency like d=0.7 and d=0.5 and we update the probability that ‘defracic surgery reduces back pain’ and also push even further ’noncausal scenarios are 36% likely” because their sample sizes were decent.
And we do update for each pair we finish, and after bouncing back and forth with each pair, we wind up with an estimate that Nature draws from the non-causal scenario 37% of the time (ie the switching probability of the mixture is p=0.37). And now we can use that as a prior in evaluating any new medicine or surgery.
If you have specific observational data you want to look at, email me if you want to chat more.
If you want to look at specific study-pairs, they’re all listed & properly cited in the papers I’ve collated & provided fulltext links for. I suspect that the more advanced methods will require individual level patient data, which sadly only a very few studies will release, but perhaps you can still find enough of those to make it worth your while and analyze if Robins et al can get a publishable paper out of just 1 RCT.
If I understood you correctly, there are two separate issues here.
The first is what people call “transportability” (how to sensibly combine results of multiple studies if units in those studies aren’t the same). People try all sorts of things (Gelman does random effects models I think?) Pearl’s student Elias Barenboim (now at Purdue) thinks about that stuff using graphs.
I wish I could help, but I don’t know as much about this subject as I want. Maybe I should think about it more.
The second issue is that in addition to units in two studies “not being the same” one study is observational (has weird treatment assignment) and one is randomized properly. That part I know a lot about, that’s classical causal inference—how to massage observational data to make it look like an RCT.
I would advise thinking about these problems separately, that is start trying to solve combining two RCTs.
edit: I know you are trying to describe things to me on the level of individual points to help me understand. But I think a more helpful way to go is to ignore sampling variability entirely, and just start with two joint distributions P1 and P2 that represent variables in your two studies (in other words you assume infinite sample size, so you get the distributions exactly). How do we combine them into a single conclusion (let’s say the “average causal effect”: difference in outcome means under treatment vs placebo)? Even this is not so easy to work out.
I would advise thinking about these problems separately, that is start trying to solve combining two RCTs.
I think when you break it into two separate problems like that, you miss the point. Combining two RCTs is reasonably well-solved by multilevel random effects models. I’m also not trying to solve the problem of inferring from a correlational dataset to specific causal models, which seems well in hand by Pearlean approaches. I’m trying to bridge between the two: assume a specific generative model for correlation vs causation and then infer the distribution.
How do we combine them into a single conclusion (let’s say the “average causal effect”: difference in outcome means under treatment vs placebo)?
But this is exactly the problem! Apparently, there is no meaningful ‘average causal effect’ between correlational and causational studies. In one study, it was much larger; in the next, it was a little smaller; in the next, it was much smaller; in the one after that, the sign reversed… If you create a regular multilevel meta-analysis of a bunch of randomized and correlational studies, say, and you toss in a fixed-effect covariate and regress ‘Y ~ Randomized’, you get an estimate of ~0. The actual effect in each case may be quite large, but the average over all the studies is a wash.
This is different from other methodological problems. With placebos, there is a predictable systematic bias which gives you a large positive bias. Likewise, publication bias skews effects up. Likewise, non-blinding of raters. And so on and so forth. You can easily estimate with an additive fixed-effect / linear model and correct for particular biases. But with random vs correlation, it seems that there’s no particular direction the effects head in, you just know that whatever they are, they’ll be different from your correlational results. So you need to do something more imaginative in modeling.
But I think a more helpful way to go is to ignore sampling variability entirely, and just start with two joint distributions P1 and P2 that represent variables in your two studies (in other words you assume infinite sample size, so you get the distributions exactly).
OK, let’s imagine all our studies are infinite sized. I collect 5 study-pairs, correlational vs randomized, d effect size:
0.5 vs 0.1 (difference: 0.4)
-0.22 vs −0.22 (difference: 0)
0.8 vs −0.2 (difference: −1.0)
0.3 vs 0.3 (difference: 0
0.5 vs −0.1 (difference: 0.6)
I apply my mixture model strategy.
We see that in study #2 and #4, the correlational and causal effects are identical, 100% confidence, and thus both were drawn from the randomized distribution. With two datapoints −0.22 and 0.3, we begin to infer that the distribution of causal effects is probably fairly narrow around 0 and we update our normal distribution appropriately to be skeptical about any claims of large causal effects.
We see in study #1, #3, and #5, that the correlational and causal effects differ, 100% confidence, and thus we know that the correlational effect for that particular treatment was drawn from the general correlational distribution. The correlational effects are .5, -.8. .5 - all quite large, and so we infer that correlational effects tend to be quite large and its distribution has a large standard deviation (or whatever).
We then note that in 2⁄5 of the pairs, the correlational effect was the causal effect, and so we estimate that the probability of a correlational effect having been drawn from the causal distribution rather than the correlation distribution is P=2/5. Or in other words, correlation=causality 40% of the time. However, if we had tried to calculate an additive variable like in a meta-regression, we would find that the Randomized covariate was estimated at exactly 0 (mean(c(0.4, 0, -1.0, 0, 0.6)) ~> [1] 0) and certainly is not statistically-significant.
Now when someone comes to us with an infinite-sized correlational trial that purified Egyptian mummy reduces allergy symptoms by d=0.5, we feed it into our mixture model and we get a useful posterior distribution which exhibits a bimodal pattern where it is heavily peaked at 0 (reflecting the more-likely-than-not scenario that mummy is mummery) but also peaked at d=0.4 or so, reflecting shrinkage of the scenario that mummy is munificent, which will predict better than if we naively tried to just shift the d=0.5 posterior distribution up or down some units.
The problem with real studies is that they are not infinitely sized, so when the point-estimates disagree and we get 0.45 vs 0.5, obviously we cannot strongly conclude which distribution in the mixture it was drawn from, and so we need to propagate that uncertainty through the whole model and all its parameters.
I think when you break it into two separate problems like that, you miss the point.
I am pretty sure I am not, but let’s see. What you are basically saying is “analysis ⇒ synthesis doesn’t work.”
Combining two RCTs is reasonably well-solved by multilevel random effects models.
Hierarchical models are a particular parametric modeling approach for data drawn from multiple sources. People use this type of stuff to good effect, but saying it “solves the problem” here is sort of like saying linear regression “solves” RCTs. What if the modeling assumptions are wrong? What if you are not sure what the model should be?
I’m also not trying to solve the problem of inferring from a correlational dataset to specific causal models, which > seems well in hand by Pearlean approaches.
Let’s call them “interventionist approaches.” Pearl is just the guy people here read. People have been doing causal analysis from observational data since at least the 70s, probably earlier in certain special cases.
I’m trying to bridge between the two: assume a specific generative model for correlation vs causation and then > infer the distribution.
Ok.
But this is exactly the problem! Apparently, there is no meaningful ‘average causal effect’ between correlational
and causational studies.
This is what we should talk about.
If there is one RCT, we have a treatment A (with two levels a, and a’) and outcome Y. Of interest is outcome under hypothetical treatment assignment to a value, which we write Y(a) or Y(a’). “Average causal effect” is E[Y(a)] - E[Y(a’)]. So far so good.
If there is one observational study, say A is assigned based on C, and C affects Y, what is of interest is still Y(a) or Y(a’). Interventionist methods would give you a formula for E[Y(a)] - E[Y(a’)] in terms of p(A,C,Y). You can then construct an estimator for that formula, and life is good. So far so good.
Note that so far I made no modeling assumptions on the relationship of A and Y at all. It’s all completely unrestricted by choice of statistical model. I can do crazy non-parametric random forest to model the relationship of A and Y if I wanted. I can do linear regression. I can do whatever. This is important—people often smuggle in modeling assumptions “too soon.” When we are talking about prediction problems like in machine learning, that’s ok. We don’t care about modeling too much we just want good predictive performance. When we care about effects, the model is important. This is because if the effect is not strong and your model is garbage, it can mislead you.
If there are two RCTs, we have two sets of outcomes: Y1(a), Y1(a’) and Y2(a), Y2(a’). Even here, there is no one causal effect so far. We need to make some sort of assumption on how to combine these. For example,
we may try to generalize regression models, and say that a lot of the way A affects Y is the same regression across the two studies, but some of the regression terms are allowed to differ to model population heterogeneity.
This is what hierarchical models do.
In general we have E[f(Y1(a), Y2(a))] - E[f(Y1(a’),Y2(a’))], for some f(.,.) that we should justify. At this level, things are completely non-parametric. We can model the relationship of A and Y1,Y2 however we want. We can model f however we want.
If we have one RCT and one observational study, we still have Y1(a), Y1(a’) for the RCT, and Y2(a), Y2(a’) for the observational study. To determine the latter we use “interventionist approaches” to express them in terms of observational data. We then combine things using f(.,.) as before. As before we should justify all the modeling we are doing.
I am pretty sure Barenboim thought about this stuff (but he doesn’t do statistical inference, just the general setup).
What you are basically saying is “analysis ⇒ synthesis doesn’t work.”
I am pretty sure it is not going to let you take an effect size and a standard error from a correlation study and get out a accurate posterior distribution of the causal effect without doing something similar to what I’m proposing.
If there are two RCTs, we have two sets of outcomes: Y1(a), Y1(a’) and Y2(a), Y2(a’). Even here, there is no one causal effect so far. We need to make some sort of assumption on how to combine these. For example, we may try to generalize regression models, and say that a lot of the way A affects Y is the same regression across the two studies, but some of the regression terms are allowed to differ to model population heterogeneity. This is what hierarchical models do. In general we have E[f(Y1(a), Y2(a))] - E[f(Y1(a’),Y2(a’))], for some f(.,.) that we should justify. At this level, things are completely non-parametric. We can model the relationship of A and Y1,Y2 however we want. We can model f however we want.
Ok, and how do we model them? I am proposing a multilevel mixture model to compare them.
If we have one RCT and one observational study, we still have Y1(a), Y1(a’) for the RCT, and Y2(a), Y2(a’) for the observational study. To determine the latter we use “interventionist approaches” to express them in terms of observational data. We then combine things using f(.,.) as before. As before we should justify all the modeling we are doing.
Which is not going to work since in most, if not all, of these studies, the original patient-level data is not going to be available and you’re not even going to get a correlation matrix out of the published paper, and I haven’t seen any intervention-style algorithms which work with just the effect sizes which is what is on offer.
To work with the sparse data that is available, you are going to have to do something in between a meta-analysis and an interventionist analysis.
I am proposing a multilevel mixture model to compare them.
Ok. You can use whatever statistical model you want, as long as we are clear what the underlying object is you are dealing with. The difficulty here isn’t the statistical modeling, but being clear about what it is that is being estimated (in other words the interpretation of the parameters of the model). This is why I don’t talk about statistical modeling at first.
haven’t seen any intervention-style algorithms which work with just the effect sizes which is what is on offer.
If all you have is reported effect sizes you won’t get anything good out. You need the data they used.
Depends on what you want. It doesn’t matter “who has priority” when it comes to learning the subject. Pearl’s book is good, but one big disadvantage of reading just Pearl is Pearl does not deal with the statistical inference end of causal inference very much (by choice). Actually, I heard Pearl has a new book in the works, more suitable for teaching.
But ultimately we must draw causal conclusions from actual data, so statistical inference is important. Some big names that combine causal and statistical inference: Jamie Robins, Miguel Hernan, Eric Tchetgen Tchetgen, Tyler VanderWeele (Harvard causal group), Mark van der Laan (Berkeley), Donald Rubin et al (Harvard), Frangakis, Rosenblum, Scharfstein, etc. (Johns Hopkins causal group), Andrea Rotnitzky (Harvard), Susan Murphy (Michigan), Thomas Richardson (UW), Phillip Dawid (Cambridge, but retired, incidentally the inventor of conditional independence notation). Lots of others.
Hi.
I am not sure I understand your question.
If I got such data I would (a) be very happy, (b) use the RCT to inform policy, and (c) use the pair to point out how correct causal inference methods can recover the RCT result if assumptions hold (hopefully they hold in the observational study). We can try to combine strength of two studies, but then the results live or die by assumptions on how treatments were assigned in the observational study.
I am also not a fan of classifying biases like they do (it’s a common silly practice). For example, it’s really not informative to say “confounding bias,” in reality you can have a lot of types of confounding, with different solutions necessary depending on the type (I like to draw pictures to understand this).
I think Robins et al (?Hernan?) at some point recovered the result of an RCT via his g methods from observational data.
The paper you are referring to is “Observational Studies Analyzed Like Randomized Experiments: An application to Postmenopausal Hormone Therapy and Coronary Heart Disease” by Hernan et al. It is available at https://cdn1.sph.harvard.edu/wp-content/uploads/sites/343/2013/03/observational-studies.pdf
The controversy about hormone replacement therapy is fascinating as a case study. Until 2002, essentially all women who reached menopause got medical advise to start taking pills containing horse estrogen. It was very widely believed that this would reduce their risk of having a heart attack. This belief primarily based on biological plausibility: Estrogen is known to reduce cholesterol, and cholesterol is believed to increase the risk of heart disease. Also, there were many observational studies that seemingly suggested that women who took hormone replacement therapy (HRT) had less risk of heart disease. (In my view, this was not surprising: Observational studies always show what the investigators expect to find.)
In 2002, the Women’s Health Initiative randomized trial was stopped early because it showed that estrogen replacement therapy actually substantially increased the risk of having a heart attack. Overnight, the medical establishment stopped recommending estrogen for menopausal women. But a perhaps more important consequence was that many clinicians stopped trusting observational studies altogether. In my opinion, this was mostly a good thing.
The largest observational study to show a protective effect of estrogen the Nurses Health Study. In 2008, my thesis advisor Miguel Hernan re-analyzed this dataset using Jamie Robins’ g-methods (which are equivalent to Pearl), and was essentially able to reproduce the results of the WHI trial. Miguel’s paper uses valid methods and gets the correct results. In my view, this shows that the new methods might work, but the paper would have meant much more if it was published prior to the randomized trials.
Miguel and Jamie’s paper sparked off a very interesting methodological debate with the original investigators at the Nurses Health Study, who are still clinging to their original analysis. See http://www.ncbi.nlm.nih.gov/pubmed/18813017 .
Many people still believe that Estrogen/HRT is beneficial. The most popular theory is that WHI recruited too many old women (sometimes in their 90s!) and that estrogen is harmful if given that long after menopause. A new randomized trial which is limited to women at menopause is currently being conducted. A second theory is that the results in the trial were due to differences in statin usage. I analyzed the second theory for my doctoral thesis, but found that this had negligible impact on the results.
It is also interesting to note that while it is true that the trial found that estrogen increased the risk of heart disease, it also showed a (non-significant) reduction in all-cause mortality. So the increased risk of cardiovascular disease didn’t even result in more deaths. Presumably, people care more about all-cause mortality than heart attacks. However, since it was “non-significant”, not even the most dedicated proponents of estrogen treatment ever point out this fact.
A side question, prompted by an amusing factoid in the Hernan paper: ”...we restricted the population to women who had reported plausible energy intakes (2510 –14,640 kJ/d)”.
In the statistical analysis in this paper, and also as a general practice in medical publications based on questionnaire data, are there adjustments for uncertainty in the questionnaire responses?
When you have a data point that says, for example, that person #12345 reports her caloric intake as 4,000 calories/day, do you take it as a hard precise number, or do you take it as an imprecise estimate with its own error which propagates into the model uncertainty, etc.?
Keyword is “measurement error.” People think hard about this. Anders_H knows this paper in a lot more detail than I do, but I expect these particular authors to be careful.
This issue is also related to “missing data.” What you see might be different from the underlying truth in systematic ways, e.g. you get systematic bias in your data, and you need to deal with that. This is also related to that causal inference stuff I keep going on about.
People like engineers and physicists think a lot about this. I am not sure that medical researchers think a lot about this. The usual (easy) way is to throw out unreasonable-looking responses during the data cleaning and then take what remains as rock-solid. Accepting that your independent variables are uncertain leads to a lot of inconvenient problems (starting with the OLS regression not being a theoretically-correct form any more).
Yes, that’s another can of worms. In some areas (e.g. self-reported food intake) the problem is so blatant and overwhelming that you have to deal with it, but if it looks minor not many people want to bother.
Clinicians do not, “methodology people” (who often partner up with “domain experts”) to do data analysis, absolutely do.
Yes, I was told the full gory details of this story (not going to repeat it here). Thanks for sharing this!
By the way, are you at Stanford now? I should find a way to drop by, Jacob’s there too.
Just putting the idea out for comment in case there’s some way this fails to deliver what I want it to deliver. Excerpting out all the comparisons and writing up the mixture model in JAGS would be a lot of work; just reading the papers takes long enough as it is.
Indeed. You can imagine that when I stumbled across Deeks and the rest of then in Google Scholar (my notes), I was overjoyed by their obvious utility (and because it meant I didn’t have to do it myself, as I was musing about doing using FDA trials) but also completely baffled: why had I never heard of these papers before?
I am not following your mixture model idea. For every data point you know if it comes from the RCT or observational study. You don’t need uncertainty about treatment assignment. What you need is figuring out how to massage observational data to get causal conclusions (e.g. what I think about all day long).
If you have specific observational data you want to look at, email me if you want to chat more.
No, the uncertainty here isn’t about which of the two studies a datapoint came from, but about whether (for a specific treatment/intervention) the correlational study datapoint was drawn from the same distribution as the randomized study datapoint or a different one, and (over all treatments/interventions) what the probability of being drawn from the same distribution is. Maybe it’ll be a little clearer if I narrate how the model might go.
So say you start off with a prior probability of 50-50 about which group a result is drawn from, a switching probability that will be tweaked as you look at data. (If you are studying turtles which could be from a large or a small species, then if you find 2 larger turtles and 8 smaller, you’re probably going to update from P=0.5 to a mixture probability more like P>0.20, since it’s most likely—but not certain—that 1 or 2 of the larger turtles came from the large species and the 8 smaller ones came from the small species.)
For your first datapoint, you have a pair of results: xyzcillin reduces all-cause mortality to RR=0.5 from a correlational study (cohort, cross-sectional, case-control, whatever), and the randomized study of xyzcillin has RR=1.1. What does this mean? Now, of course you know that 0.5 is the correlational result and 1.1 is the randomized result, but we can imagine two relatively distinct scenarios here: ‘xyzcillin actually works but the causal effect is really more like RR=0.7 and the randomized trial was underpowered’, or, ‘xyzcillin has no causal effect whatsoever on mortality and it’s just a bunch of powerful confounds producing results like RR=0.6-0.8’. We observe that 1.1 supports the latter more, and we update towards ‘xyzcillin has 0 effect’ and now give ‘non-causal scenarios are 55% likely’, but not too much because the xyzcillin studies were small and underpowered and so they don’t support the latter scenario that much.
Then for the next datapoint, ‘abcmycin reduces lung cancer’, we get a pair looking like 0.9 and 0.7, and we observe these large trials are very consistent with each other and so they highly support the former theory instead and we update towards ‘abcmycin causally reduces lung cancer’ and ‘noncausal scenarios are 39% likely’.
Then for the third datapoint about defracic surgery for backpain, we again get consistency like d=0.7 and d=0.5 and we update the probability that ‘defracic surgery reduces back pain’ and also push even further ’noncausal scenarios are 36% likely” because their sample sizes were decent.
And we do update for each pair we finish, and after bouncing back and forth with each pair, we wind up with an estimate that Nature draws from the non-causal scenario 37% of the time (ie the switching probability of the mixture is p=0.37). And now we can use that as a prior in evaluating any new medicine or surgery.
If you want to look at specific study-pairs, they’re all listed & properly cited in the papers I’ve collated & provided fulltext links for. I suspect that the more advanced methods will require individual level patient data, which sadly only a very few studies will release, but perhaps you can still find enough of those to make it worth your while and analyze if Robins et al can get a publishable paper out of just 1 RCT.
If I understood you correctly, there are two separate issues here.
The first is what people call “transportability” (how to sensibly combine results of multiple studies if units in those studies aren’t the same). People try all sorts of things (Gelman does random effects models I think?) Pearl’s student Elias Barenboim (now at Purdue) thinks about that stuff using graphs.
I wish I could help, but I don’t know as much about this subject as I want. Maybe I should think about it more.
The second issue is that in addition to units in two studies “not being the same” one study is observational (has weird treatment assignment) and one is randomized properly. That part I know a lot about, that’s classical causal inference—how to massage observational data to make it look like an RCT.
I would advise thinking about these problems separately, that is start trying to solve combining two RCTs.
edit: I know you are trying to describe things to me on the level of individual points to help me understand. But I think a more helpful way to go is to ignore sampling variability entirely, and just start with two joint distributions P1 and P2 that represent variables in your two studies (in other words you assume infinite sample size, so you get the distributions exactly). How do we combine them into a single conclusion (let’s say the “average causal effect”: difference in outcome means under treatment vs placebo)? Even this is not so easy to work out.
I think when you break it into two separate problems like that, you miss the point. Combining two RCTs is reasonably well-solved by multilevel random effects models. I’m also not trying to solve the problem of inferring from a correlational dataset to specific causal models, which seems well in hand by Pearlean approaches. I’m trying to bridge between the two: assume a specific generative model for correlation vs causation and then infer the distribution.
But this is exactly the problem! Apparently, there is no meaningful ‘average causal effect’ between correlational and causational studies. In one study, it was much larger; in the next, it was a little smaller; in the next, it was much smaller; in the one after that, the sign reversed… If you create a regular multilevel meta-analysis of a bunch of randomized and correlational studies, say, and you toss in a fixed-effect covariate and regress ‘Y ~ Randomized’, you get an estimate of ~0. The actual effect in each case may be quite large, but the average over all the studies is a wash.
This is different from other methodological problems. With placebos, there is a predictable systematic bias which gives you a large positive bias. Likewise, publication bias skews effects up. Likewise, non-blinding of raters. And so on and so forth. You can easily estimate with an additive fixed-effect / linear model and correct for particular biases. But with random vs correlation, it seems that there’s no particular direction the effects head in, you just know that whatever they are, they’ll be different from your correlational results. So you need to do something more imaginative in modeling.
OK, let’s imagine all our studies are infinite sized. I collect 5 study-pairs, correlational vs randomized, d effect size:
0.5 vs 0.1 (difference: 0.4)
-0.22 vs −0.22 (difference: 0)
0.8 vs −0.2 (difference: −1.0)
0.3 vs 0.3 (difference: 0
0.5 vs −0.1 (difference: 0.6)
I apply my mixture model strategy.
We see that in study #2 and #4, the correlational and causal effects are identical, 100% confidence, and thus both were drawn from the randomized distribution. With two datapoints −0.22 and 0.3, we begin to infer that the distribution of causal effects is probably fairly narrow around 0 and we update our normal distribution appropriately to be skeptical about any claims of large causal effects.
We see in study #1, #3, and #5, that the correlational and causal effects differ, 100% confidence, and thus we know that the correlational effect for that particular treatment was drawn from the general correlational distribution. The correlational effects are .5, -.8. .5 - all quite large, and so we infer that correlational effects tend to be quite large and its distribution has a large standard deviation (or whatever).
We then note that in 2⁄5 of the pairs, the correlational effect was the causal effect, and so we estimate that the probability of a correlational effect having been drawn from the causal distribution rather than the correlation distribution is P=2/5. Or in other words, correlation=causality 40% of the time. However, if we had tried to calculate an additive variable like in a meta-regression, we would find that the Randomized covariate was estimated at exactly 0 (
mean(c(0.4, 0, -1.0, 0, 0.6)) ~> [1] 0
) and certainly is not statistically-significant.Now when someone comes to us with an infinite-sized correlational trial that purified Egyptian mummy reduces allergy symptoms by d=0.5, we feed it into our mixture model and we get a useful posterior distribution which exhibits a bimodal pattern where it is heavily peaked at 0 (reflecting the more-likely-than-not scenario that mummy is mummery) but also peaked at d=0.4 or so, reflecting shrinkage of the scenario that mummy is munificent, which will predict better than if we naively tried to just shift the d=0.5 posterior distribution up or down some units.
The problem with real studies is that they are not infinitely sized, so when the point-estimates disagree and we get 0.45 vs 0.5, obviously we cannot strongly conclude which distribution in the mixture it was drawn from, and so we need to propagate that uncertainty through the whole model and all its parameters.
I am pretty sure I am not, but let’s see. What you are basically saying is “analysis ⇒ synthesis doesn’t work.”
Hierarchical models are a particular parametric modeling approach for data drawn from multiple sources. People use this type of stuff to good effect, but saying it “solves the problem” here is sort of like saying linear regression “solves” RCTs. What if the modeling assumptions are wrong? What if you are not sure what the model should be?
Let’s call them “interventionist approaches.” Pearl is just the guy people here read. People have been doing causal analysis from observational data since at least the 70s, probably earlier in certain special cases.
Ok.
This is what we should talk about.
If there is one RCT, we have a treatment A (with two levels a, and a’) and outcome Y. Of interest is outcome under hypothetical treatment assignment to a value, which we write Y(a) or Y(a’). “Average causal effect” is E[Y(a)] - E[Y(a’)]. So far so good.
If there is one observational study, say A is assigned based on C, and C affects Y, what is of interest is still Y(a) or Y(a’). Interventionist methods would give you a formula for E[Y(a)] - E[Y(a’)] in terms of p(A,C,Y). You can then construct an estimator for that formula, and life is good. So far so good.
Note that so far I made no modeling assumptions on the relationship of A and Y at all. It’s all completely unrestricted by choice of statistical model. I can do crazy non-parametric random forest to model the relationship of A and Y if I wanted. I can do linear regression. I can do whatever. This is important—people often smuggle in modeling assumptions “too soon.” When we are talking about prediction problems like in machine learning, that’s ok. We don’t care about modeling too much we just want good predictive performance. When we care about effects, the model is important. This is because if the effect is not strong and your model is garbage, it can mislead you.
If there are two RCTs, we have two sets of outcomes: Y1(a), Y1(a’) and Y2(a), Y2(a’). Even here, there is no one causal effect so far. We need to make some sort of assumption on how to combine these. For example, we may try to generalize regression models, and say that a lot of the way A affects Y is the same regression across the two studies, but some of the regression terms are allowed to differ to model population heterogeneity. This is what hierarchical models do.
In general we have E[f(Y1(a), Y2(a))] - E[f(Y1(a’),Y2(a’))], for some f(.,.) that we should justify. At this level, things are completely non-parametric. We can model the relationship of A and Y1,Y2 however we want. We can model f however we want.
If we have one RCT and one observational study, we still have Y1(a), Y1(a’) for the RCT, and Y2(a), Y2(a’) for the observational study. To determine the latter we use “interventionist approaches” to express them in terms of observational data. We then combine things using f(.,.) as before. As before we should justify all the modeling we are doing.
I am pretty sure Barenboim thought about this stuff (but he doesn’t do statistical inference, just the general setup).
I am pretty sure it is not going to let you take an effect size and a standard error from a correlation study and get out a accurate posterior distribution of the causal effect without doing something similar to what I’m proposing.
Ok, and how do we model them? I am proposing a multilevel mixture model to compare them.
Which is not going to work since in most, if not all, of these studies, the original patient-level data is not going to be available and you’re not even going to get a correlation matrix out of the published paper, and I haven’t seen any intervention-style algorithms which work with just the effect sizes which is what is on offer.
To work with the sparse data that is available, you are going to have to do something in between a meta-analysis and an interventionist analysis.
Ok. You can use whatever statistical model you want, as long as we are clear what the underlying object is you are dealing with. The difficulty here isn’t the statistical modeling, but being clear about what it is that is being estimated (in other words the interpretation of the parameters of the model). This is why I don’t talk about statistical modeling at first.
If all you have is reported effect sizes you won’t get anything good out. You need the data they used.
Is there anyone you would recommend studying in addition?
Depends on what you want. It doesn’t matter “who has priority” when it comes to learning the subject. Pearl’s book is good, but one big disadvantage of reading just Pearl is Pearl does not deal with the statistical inference end of causal inference very much (by choice). Actually, I heard Pearl has a new book in the works, more suitable for teaching.
But ultimately we must draw causal conclusions from actual data, so statistical inference is important. Some big names that combine causal and statistical inference: Jamie Robins, Miguel Hernan, Eric Tchetgen Tchetgen, Tyler VanderWeele (Harvard causal group), Mark van der Laan (Berkeley), Donald Rubin et al (Harvard), Frangakis, Rosenblum, Scharfstein, etc. (Johns Hopkins causal group), Andrea Rotnitzky (Harvard), Susan Murphy (Michigan), Thomas Richardson (UW), Phillip Dawid (Cambridge, but retired, incidentally the inventor of conditional independence notation). Lots of others.
I believe Stephen Cole posts here, and he does this stuff also (http://sph.unc.edu/adv_profile/stephen-r-cole-phd/).
Miguel Hernan and Jamie Robins are working on a new causal inference book that is more statistical, might be worth a look. Drafts available online:
http://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/
You specialise in identifying the determinants of biases in causal inference? Just curious :) Interesting
And how to make those biases go away, yes.