gwern comments on Open thread, Dec. 21 - Dec. 27, 2015

gwern 22 Dec 2015 23:08 UTC
2 points

I am not sure I understand your question.

Just putting the idea out for comment in case there’s some way this fails to deliver what I want it to deliver. Excerpting out all the comparisons and writing up the mixture model in JAGS would be a lot of work; just reading the papers takes long enough as it is.

If I got such data I would (a) be very happy, (b) use the RCT to inform policy, and (c) use the pair to point out how correct causal inference methods can recover the RCT result if assumptions hold (hopefully they hold in the observational study)

Indeed. You can imagine that when I stumbled across Deeks and the rest of then in Google Scholar (my notes), I was overjoyed by their obvious utility (and because it meant I didn’t have to do it myself, as I was musing about doing using FDA trials) but also completely baffled: why had I never heard of these papers before?
- IlyaShpitser 22 Dec 2015 23:27 UTC
  1 point
  Parent
  I am not following your mixture model idea. For every data point you know if it comes from the RCT or observational study. You don’t need uncertainty about treatment assignment. What you need is figuring out how to massage observational data to get causal conclusions (e.g. what I think about all day long).
  
  If you have specific observational data you want to look at, email me if you want to chat more.
  - gwern 23 Dec 2015 2:43 UTC
    3 points
    Parent
    
    For every data point you know if it comes from the RCT or observational study. You don’t need uncertainty about treatment assignment.
    
    No, the uncertainty here isn’t about which of the two studies a datapoint came from, but about whether (for a specific treatment/intervention) the correlational study datapoint was drawn from the same distribution as the randomized study datapoint or a different one, and (over all treatments/interventions) what the probability of being drawn from the same distribution is. Maybe it’ll be a little clearer if I narrate how the model might go.
    
    So say you start off with a prior probability of 50-50 about which group a result is drawn from, a switching probability that will be tweaked as you look at data. (If you are studying turtles which could be from a large or a small species, then if you find 2 larger turtles and 8 smaller, you’re probably going to update from P=0.5 to a mixture probability more like P>0.20, since it’s most likely—but not certain—that 1 or 2 of the larger turtles came from the large species and the 8 smaller ones came from the small species.)
    
    For your first datapoint, you have a pair of results: xyzcillin reduces all-cause mortality to RR=0.5 from a correlational study (cohort, cross-sectional, case-control, whatever), and the randomized study of xyzcillin has RR=1.1. What does this mean? Now, of course you know that 0.5 is the correlational result and 1.1 is the randomized result, but we can imagine two relatively distinct scenarios here: ‘xyzcillin actually works but the causal effect is really more like RR=0.7 and the randomized trial was underpowered’, or, ‘xyzcillin has no causal effect whatsoever on mortality and it’s just a bunch of powerful confounds producing results like RR=0.6-0.8’. We observe that 1.1 supports the latter more, and we update towards ‘xyzcillin has 0 effect’ and now give ‘non-causal scenarios are 55% likely’, but not too much because the xyzcillin studies were small and underpowered and so they don’t support the latter scenario that much.
    
    Then for the next datapoint, ‘abcmycin reduces lung cancer’, we get a pair looking like 0.9 and 0.7, and we observe these large trials are very consistent with each other and so they highly support the former theory instead and we update towards ‘abcmycin causally reduces lung cancer’ and ‘noncausal scenarios are 39% likely’.
    
    Then for the third datapoint about defracic surgery for backpain, we again get consistency like d=0.7 and d=0.5 and we update the probability that ‘defracic surgery reduces back pain’ and also push even further ’noncausal scenarios are 36% likely” because their sample sizes were decent.
    
    And we do update for each pair we finish, and after bouncing back and forth with each pair, we wind up with an estimate that Nature draws from the non-causal scenario 37% of the time (ie the switching probability of the mixture is p=0.37). And now we can use that as a prior in evaluating any new medicine or surgery.
    
    If you have specific observational data you want to look at, email me if you want to chat more.
    
    If you want to look at specific study-pairs, they’re all listed & properly cited in the papers I’ve collated & provided fulltext links for. I suspect that the more advanced methods will require individual level patient data, which sadly only a very few studies will release, but perhaps you can still find enough of those to make it worth your while and analyze if Robins et al can get a publishable paper out of just 1 RCT.
    - IlyaShpitser 23 Dec 2015 19:33 UTC
      1 point
      Parent
      If I understood you correctly, there are two separate issues here.
      
      The first is what people call “transportability” (how to sensibly combine results of multiple studies if units in those studies aren’t the same). People try all sorts of things (Gelman does random effects models I think?) Pearl’s student Elias Barenboim (now at Purdue) thinks about that stuff using graphs.
      
      I wish I could help, but I don’t know as much about this subject as I want. Maybe I should think about it more.
      
      The second issue is that in addition to units in two studies “not being the same” one study is observational (has weird treatment assignment) and one is randomized properly. That part I know a lot about, that’s classical causal inference—how to massage observational data to make it look like an RCT.
      
      I would advise thinking about these problems separately, that is start trying to solve combining two RCTs.
      
      edit: I know you are trying to describe things to me on the level of individual points to help me understand. But I think a more helpful way to go is to ignore sampling variability entirely, and just start with two joint distributions P1 and P2 that represent variables in your two studies (in other words you assume infinite sample size, so you get the distributions exactly). How do we combine them into a single conclusion (let’s say the “average causal effect”: difference in outcome means under treatment vs placebo)? Even this is not so easy to work out.
      - gwern 24 Dec 2015 22:51 UTC
        1 point
        Parent
        
        I would advise thinking about these problems separately, that is start trying to solve combining two RCTs.
        
        I think when you break it into two separate problems like that, you miss the point. Combining two RCTs is reasonably well-solved by multilevel random effects models. I’m also not trying to solve the problem of inferring from a correlational dataset to specific causal models, which seems well in hand by Pearlean approaches. I’m trying to bridge between the two: assume a specific generative model for correlation vs causation and then infer the distribution.
        
        How do we combine them into a single conclusion (let’s say the “average causal effect”: difference in outcome means under treatment vs placebo)?
        
        But this is exactly the problem! Apparently, there is no meaningful ‘average causal effect’ between correlational and causational studies. In one study, it was much larger; in the next, it was a little smaller; in the next, it was much smaller; in the one after that, the sign reversed… If you create a regular multilevel meta-analysis of a bunch of randomized and correlational studies, say, and you toss in a fixed-effect covariate and regress ‘Y ~ Randomized’, you get an estimate of ~0. The actual effect in each case may be quite large, but the average over all the studies is a wash.
        
        This is different from other methodological problems. With placebos, there is a predictable systematic bias which gives you a large positive bias. Likewise, publication bias skews effects up. Likewise, non-blinding of raters. And so on and so forth. You can easily estimate with an additive fixed-effect / linear model and correct for particular biases. But with random vs correlation, it seems that there’s no particular direction the effects head in, you just know that whatever they are, they’ll be different from your correlational results. So you need to do something more imaginative in modeling.
        
        But I think a more helpful way to go is to ignore sampling variability entirely, and just start with two joint distributions P1 and P2 that represent variables in your two studies (in other words you assume infinite sample size, so you get the distributions exactly).
        
        OK, let’s imagine all our studies are infinite sized. I collect 5 study-pairs, correlational vs randomized, d effect size:
        
        0.5 vs 0.1 (difference: 0.4)
        -0.22 vs −0.22 (difference: 0)
        0.8 vs −0.2 (difference: −1.0)
        0.3 vs 0.3 (difference: 0
        0.5 vs −0.1 (difference: 0.6)
        
        I apply my mixture model strategy.
        
        We see that in study #2 and #4, the correlational and causal effects are identical, 100% confidence, and thus both were drawn from the randomized distribution. With two datapoints −0.22 and 0.3, we begin to infer that the distribution of causal effects is probably fairly narrow around 0 and we update our normal distribution appropriately to be skeptical about any claims of large causal effects.
        
        We see in study #1, #3, and #5, that the correlational and causal effects differ, 100% confidence, and thus we know that the correlational effect for that particular treatment was drawn from the general correlational distribution. The correlational effects are .5, -.8. .5 - all quite large, and so we infer that correlational effects tend to be quite large and its distribution has a large standard deviation (or whatever).
        
        We then note that in ²⁄₅ of the pairs, the correlational effect was the causal effect, and so we estimate that the probability of a correlational effect having been drawn from the causal distribution rather than the correlation distribution is P=2/5. Or in other words, correlation=causality 40% of the time. However, if we had tried to calculate an additive variable like in a meta-regression, we would find that the Randomized covariate was estimated at exactly 0 (mean(c(0.4, 0, -1.0, 0, 0.6)) ~> [1] 0) and certainly is not statistically-significant.
        
        Now when someone comes to us with an infinite-sized correlational trial that purified Egyptian mummy reduces allergy symptoms by d=0.5, we feed it into our mixture model and we get a useful posterior distribution which exhibits a bimodal pattern where it is heavily peaked at 0 (reflecting the more-likely-than-not scenario that mummy is mummery) but also peaked at d=0.4 or so, reflecting shrinkage of the scenario that mummy is munificent, which will predict better than if we naively tried to just shift the d=0.5 posterior distribution up or down some units.
        
        The problem with real studies is that they are not infinitely sized, so when the point-estimates disagree and we get 0.45 vs 0.5, obviously we cannot strongly conclude which distribution in the mixture it was drawn from, and so we need to propagate that uncertainty through the whole model and all its parameters.
        IlyaShpitser 29 Dec 2015 20:00 UTC
        3 points
        Parent
        
        I think when you break it into two separate problems like that, you miss the point.
        
        I am pretty sure I am not, but let’s see. What you are basically saying is “analysis ⇒ synthesis doesn’t work.”
        
        Combining two RCTs is reasonably well-solved by multilevel random effects models.
        
        Hierarchical models are a particular parametric modeling approach for data drawn from multiple sources. People use this type of stuff to good effect, but saying it “solves the problem” here is sort of like saying linear regression “solves” RCTs. What if the modeling assumptions are wrong? What if you are not sure what the model should be?
        
        I’m also not trying to solve the problem of inferring from a correlational dataset to specific causal models, which > seems well in hand by Pearlean approaches.
        
        Let’s call them “interventionist approaches.” Pearl is just the guy people here read. People have been doing causal analysis from observational data since at least the 70s, probably earlier in certain special cases.
        
        I’m trying to bridge between the two: assume a specific generative model for correlation vs causation and then > infer the distribution.
        
        Ok.
        
        But this is exactly the problem! Apparently, there is no meaningful ‘average causal effect’ between correlational and causational studies.
        
        This is what we should talk about.
        
        If there is one RCT, we have a treatment A (with two levels a, and a’) and outcome Y. Of interest is outcome under hypothetical treatment assignment to a value, which we write Y(a) or Y(a’). “Average causal effect” is E[Y(a)] - E[Y(a’)]. So far so good.
        
        If there is one observational study, say A is assigned based on C, and C affects Y, what is of interest is still Y(a) or Y(a’). Interventionist methods would give you a formula for E[Y(a)] - E[Y(a’)] in terms of p(A,C,Y). You can then construct an estimator for that formula, and life is good. So far so good.
        
        Note that so far I made no modeling assumptions on the relationship of A and Y at all. It’s all completely unrestricted by choice of statistical model. I can do crazy non-parametric random forest to model the relationship of A and Y if I wanted. I can do linear regression. I can do whatever. This is important—people often smuggle in modeling assumptions “too soon.” When we are talking about prediction problems like in machine learning, that’s ok. We don’t care about modeling too much we just want good predictive performance. When we care about effects, the model is important. This is because if the effect is not strong and your model is garbage, it can mislead you.
        
        If there are two RCTs, we have two sets of outcomes: Y1(a), Y1(a’) and Y2(a), Y2(a’). Even here, there is no one causal effect so far. We need to make some sort of assumption on how to combine these. For example, we may try to generalize regression models, and say that a lot of the way A affects Y is the same regression across the two studies, but some of the regression terms are allowed to differ to model population heterogeneity. This is what hierarchical models do.
        
        In general we have E[f(Y1(a), Y2(a))] - E[f(Y1(a’),Y2(a’))], for some f(.,.) that we should justify. At this level, things are completely non-parametric. We can model the relationship of A and Y1,Y2 however we want. We can model f however we want.
        
        If we have one RCT and one observational study, we still have Y1(a), Y1(a’) for the RCT, and Y2(a), Y2(a’) for the observational study. To determine the latter we use “interventionist approaches” to express them in terms of observational data. We then combine things using f(.,.) as before. As before we should justify all the modeling we are doing.
        
        I am pretty sure Barenboim thought about this stuff (but he doesn’t do statistical inference, just the general setup).
        gwern 30 Dec 2015 15:51 UTC
        1 point
        Parent
        
        What you are basically saying is “analysis ⇒ synthesis doesn’t work.”
        
        I am pretty sure it is not going to let you take an effect size and a standard error from a correlation study and get out a accurate posterior distribution of the causal effect without doing something similar to what I’m proposing.
        
        If there are two RCTs, we have two sets of outcomes: Y1(a), Y1(a’) and Y2(a), Y2(a’). Even here, there is no one causal effect so far. We need to make some sort of assumption on how to combine these. For example, we may try to generalize regression models, and say that a lot of the way A affects Y is the same regression across the two studies, but some of the regression terms are allowed to differ to model population heterogeneity. This is what hierarchical models do. In general we have E[f(Y1(a), Y2(a))] - E[f(Y1(a’),Y2(a’))], for some f(.,.) that we should justify. At this level, things are completely non-parametric. We can model the relationship of A and Y1,Y2 however we want. We can model f however we want.
        
        Ok, and how do we model them? I am proposing a multilevel mixture model to compare them.
        
        If we have one RCT and one observational study, we still have Y1(a), Y1(a’) for the RCT, and Y2(a), Y2(a’) for the observational study. To determine the latter we use “interventionist approaches” to express them in terms of observational data. We then combine things using f(.,.) as before. As before we should justify all the modeling we are doing.
        
        Which is not going to work since in most, if not all, of these studies, the original patient-level data is not going to be available and you’re not even going to get a correlation matrix out of the published paper, and I haven’t seen any intervention-style algorithms which work with just the effect sizes which is what is on offer.
        
        To work with the sparse data that is available, you are going to have to do something in between a meta-analysis and an interventionist analysis.
        IlyaShpitser 30 Dec 2015 19:38 UTC
        1 point
        Parent
        
        I am proposing a multilevel mixture model to compare them.
        
        Ok. You can use whatever statistical model you want, as long as we are clear what the underlying object is you are dealing with. The difficulty here isn’t the statistical modeling, but being clear about what it is that is being estimated (in other words the interpretation of the parameters of the model). This is why I don’t talk about statistical modeling at first.
        
        haven’t seen any intervention-style algorithms which work with just the effect sizes which is what is on offer.
        
        If all you have is reported effect sizes you won’t get anything good out. You need the data they used.
        Richard_Kennaway 30 Dec 2015 9:28 UTC
        1 point
        Parent
        
        Pearl is just the guy people here read.
        
        Is there anyone you would recommend studying in addition?
        IlyaShpitser 31 Dec 2015 20:47 UTC
        1 point
        Parent
        Depends on what you want. It doesn’t matter “who has priority” when it comes to learning the subject. Pearl’s book is good, but one big disadvantage of reading just Pearl is Pearl does not deal with the statistical inference end of causal inference very much (by choice). Actually, I heard Pearl has a new book in the works, more suitable for teaching.
        
        But ultimately we must draw causal conclusions from actual data, so statistical inference is important. Some big names that combine causal and statistical inference: Jamie Robins, Miguel Hernan, Eric Tchetgen Tchetgen, Tyler VanderWeele (Harvard causal group), Mark van der Laan (Berkeley), Donald Rubin et al (Harvard), Frangakis, Rosenblum, Scharfstein, etc. (Johns Hopkins causal group), Andrea Rotnitzky (Harvard), Susan Murphy (Michigan), Thomas Richardson (UW), Phillip Dawid (Cambridge, but retired, incidentally the inventor of conditional independence notation). Lots of others.
        
        I believe Stephen Cole posts here, and he does this stuff also (http://sph.unc.edu/adv_profile/stephen-r-cole-phd/).
        
        Miguel Hernan and Jamie Robins are working on a new causal inference book that is more statistical, might be worth a look. Drafts available online:
        
        http://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/
  - [deleted] 24 Dec 2015 12:37 UTC
    0 points
    Parent
    
    what I think about all day long
    
    You specialise in identifying the determinants of biases in causal inference? Just curious :) Interesting
    - IlyaShpitser 24 Dec 2015 17:27 UTC
      2 points
      Parent
      And how to make those biases go away, yes.