It may appear that the partitioned data always give a better answer than the segregated data. Unfortunately, this just isn’t true.
First, I assume you mean “aggregated”, otherwise this statement doesn’t make sense.
Second, I don’t believe you. I say it’s always smarter to use the partitioned data than the aggregate data. If you have a data set that includes the gender of the subject, you’re always better off building two models (one for each gender) instead of one big model. Why throw away information?
There is a nugget of truth to your claim, which is that sometimes the partitioning strategy becomes impractical. To see why, consider what happens when you first partition on gender, then on history of heart disease. The number of partitions jumps from two to four, meaning there are fewer data samples in each partition. When you add a couple more variables, you will have more partitions than data samples, meaning that most partitions will be empty.
So you don’t always want to do as much partitioning as you plausibly could. Instead, you want to figure out how to combine single partition statistics corresponding to each condition (gender, history,etc) into one large predictive model. This can be attacked with techniques like AdaBoost or MaxEnt.
Second, I don’t believe you. I say it’s always smarter to use the partitioned data than the aggregate data. If you have a data set that includes the gender of the subject, you’re always better off building two models (one for each gender) instead of one big model. Why throw away information?
Because, as Von Neumann was supposed to have said, “with four parameters I can fit an elephant, and with five I can make him wiggle his trunk.” Unless your data is good enough to support the existence of the other factors, or you have other data available that does so, a model you fit to the lowest-level data is likely to capture more noise than reality.
Right, so the challenge is to incorporate as much auxiliary information as possible without overfitting. That’s what AdaBoost does—if you run it for T rounds, the complexity of the model you get is linear in T, not exponential as you would get from fitting the model to the finest partitions.
This is in general one of the advantages of Bayesian statistics in that you can split the line between aggregate and separated data with techniques that automatically include partial pooling and information sharing between various levels of the analysis. (See pretty much anything written by Andrew Gelman, but Bayesian Data Analysis is a great book to cover Gelman’s whole perspective.)
The OP’s assertion is true. Stratifying on certain variables can introduce bias.
Consider that you have a cohort of initially healthy men, and you are trying to quantify the causal relationship between an exposure (eg eating hamburgers) and an outcome (eg death). You have also measured a third variable, which is angina pectoris (cardiovascular disease).
Assume that the true underlying causal structure, which you are unaware of, is that hamburgers cause cardiovascular disease, which subsequently causes death.
Now look at what happens if you stratify on cardiovascular disease: In the strata consisting of men who don’t have cardiovascular disease, you will find no cases of death. This will lead you to conclude that in men who don’t have cardiovascular disease, eating hamburgers does not cause death. This is false, as eating hamburgers will cause them to develop cardiovascular disease and then die.
What you have done in this situation, is stratify on a mediator, thereby “blocking” the pathway running through it. There are also many other situations in which adjusting for a variable introduces bias, but it gets more complicated from here.
Second, I don’t believe you. I say it’s always smarter to use the partitioned data than the aggregate data. If you have a data set that includes the gender of the subject, you’re always better off building two models (one for each gender) instead of one big model. Why throw away information?
If you believe the OP’s assertion
Similarly, for just about any given set of data, you can find some partition which reverses the apparent correlation
then it is demonstrably false that your strategy always improves matters. Why do you believe that your strategy is better?
Partitioning may reverse the correlation or it may not; either way, it provides a more accurate model.
Let’s do this formally. Let R, G, and T be the three variables of interest in the OP’s example, corresponding to Recovery, Gender, and Treatment. Then the goal is to obtain a model of the probability of R, given T and maybe G. My assertion is that a model of the form P(R|G,T) is always going to be more accurate than a model of the form P(R|T) alone—you can’t gain anything by throwing away the G variable. The accuracy can be measured in terms of the log-likelihood of the data given the model.
It is actually tautologically true that P(R|G,T) will provide a higher log-likelihood than P(R|T). The issue raised by RobinZ is that P(R|G,T) might overfit the data: the accuracy improvement achieved by including G might not justify the increase in model complexity. That will certainly happen if naive modeling methods are used, but there are ways to incorporate multiple information sources without overfitting.
Partitioning may reverse the correlation or it may not; either way, it provides a more accurate model.
Usually. But, partitioning reduces the number of samples within each partition, and can thus increase the effects of chance. This is even worse if you have a lot of variables floating around that you can partition against. At some point it becomes easy to choose a partition that purely by coincidence is apparently very predictive on this data set, but that actually has no causal role.
RobinZ is that P(R|G,T) might overfit the data: the accuracy improvement achieved by including G might not justify the increase in model complexity.
My assertion is that a model of the form P(R|G,T) is always going to be more accurate than a model of the form P(R|T) alone—you can’t gain anything by throwing away the G variable.
That’s all true (modulo the objection about overfitting). However, there is the case where T affects G which in turn affects R. (Presumably this doesn’t apply when T = treatment and G = gender). If what we’re interested in is the effect of T on R (irrespective of which other variables ‘transmit’ the causal influence) then conditioning on G may obscure the pattern we’re trying to detect.
(Apologies for not writing the above paragraph using rigorous language, but hopefully the point is obvious enough.)
Let’s say the only data we’d collected were gender and whether or not the patient’s birthday was a Tuesday. Do you really think there is something to be gained from building four separate models now?
More seriously, if you collect enough information, then purely by chance there will be some partitioning of the data which gives the wrong conclusion.
I don’t think we disagree on anything important here—the main point is that you need to be careful when choosing which partitions of the data you use—arbitrarily partitioning along every available divide is not optimal.
PS—thanks for the typo correction, I really need to learn to proofread...
Good post, thanks. One comment:
First, I assume you mean “aggregated”, otherwise this statement doesn’t make sense.
Second, I don’t believe you. I say it’s always smarter to use the partitioned data than the aggregate data. If you have a data set that includes the gender of the subject, you’re always better off building two models (one for each gender) instead of one big model. Why throw away information?
There is a nugget of truth to your claim, which is that sometimes the partitioning strategy becomes impractical. To see why, consider what happens when you first partition on gender, then on history of heart disease. The number of partitions jumps from two to four, meaning there are fewer data samples in each partition. When you add a couple more variables, you will have more partitions than data samples, meaning that most partitions will be empty.
So you don’t always want to do as much partitioning as you plausibly could. Instead, you want to figure out how to combine single partition statistics corresponding to each condition (gender, history,etc) into one large predictive model. This can be attacked with techniques like AdaBoost or MaxEnt.
Because, as Von Neumann was supposed to have said, “with four parameters I can fit an elephant, and with five I can make him wiggle his trunk.” Unless your data is good enough to support the existence of the other factors, or you have other data available that does so, a model you fit to the lowest-level data is likely to capture more noise than reality.
Right, so the challenge is to incorporate as much auxiliary information as possible without overfitting. That’s what AdaBoost does—if you run it for T rounds, the complexity of the model you get is linear in T, not exponential as you would get from fitting the model to the finest partitions.
This is in general one of the advantages of Bayesian statistics in that you can split the line between aggregate and separated data with techniques that automatically include partial pooling and information sharing between various levels of the analysis. (See pretty much anything written by Andrew Gelman, but Bayesian Data Analysis is a great book to cover Gelman’s whole perspective.)
The OP’s assertion is true. Stratifying on certain variables can introduce bias.
Consider that you have a cohort of initially healthy men, and you are trying to quantify the causal relationship between an exposure (eg eating hamburgers) and an outcome (eg death). You have also measured a third variable, which is angina pectoris (cardiovascular disease).
Assume that the true underlying causal structure, which you are unaware of, is that hamburgers cause cardiovascular disease, which subsequently causes death.
Now look at what happens if you stratify on cardiovascular disease: In the strata consisting of men who don’t have cardiovascular disease, you will find no cases of death. This will lead you to conclude that in men who don’t have cardiovascular disease, eating hamburgers does not cause death. This is false, as eating hamburgers will cause them to develop cardiovascular disease and then die.
What you have done in this situation, is stratify on a mediator, thereby “blocking” the pathway running through it. There are also many other situations in which adjusting for a variable introduces bias, but it gets more complicated from here.
For further information on this I suggest reading an upcoming book called “Causal Inference”, by James Robins and Miguel Hernan, who taught me this material. The first ten chapters are available for free online at http://www.hsph.harvard.edu/faculty/miguel-hernan/files/hernanrobins_v1.10.9.pdf .
If you believe the OP’s assertion
then it is demonstrably false that your strategy always improves matters. Why do you believe that your strategy is better?
Partitioning may reverse the correlation or it may not; either way, it provides a more accurate model.
Let’s do this formally. Let R, G, and T be the three variables of interest in the OP’s example, corresponding to Recovery, Gender, and Treatment. Then the goal is to obtain a model of the probability of R, given T and maybe G. My assertion is that a model of the form P(R|G,T) is always going to be more accurate than a model of the form P(R|T) alone—you can’t gain anything by throwing away the G variable. The accuracy can be measured in terms of the log-likelihood of the data given the model.
It is actually tautologically true that P(R|G,T) will provide a higher log-likelihood than P(R|T). The issue raised by RobinZ is that P(R|G,T) might overfit the data: the accuracy improvement achieved by including G might not justify the increase in model complexity. That will certainly happen if naive modeling methods are used, but there are ways to incorporate multiple information sources without overfitting.
Usually. But, partitioning reduces the number of samples within each partition, and can thus increase the effects of chance. This is even worse if you have a lot of variables floating around that you can partition against. At some point it becomes easy to choose a partition that purely by coincidence is apparently very predictive on this data set, but that actually has no causal role.
Exactly.
That’s all true (modulo the objection about overfitting). However, there is the case where T affects G which in turn affects R. (Presumably this doesn’t apply when T = treatment and G = gender). If what we’re interested in is the effect of T on R (irrespective of which other variables ‘transmit’ the causal influence) then conditioning on G may obscure the pattern we’re trying to detect.
(Apologies for not writing the above paragraph using rigorous language, but hopefully the point is obvious enough.)
Let’s say the only data we’d collected were gender and whether or not the patient’s birthday was a Tuesday. Do you really think there is something to be gained from building four separate models now?
More seriously, if you collect enough information, then purely by chance there will be some partitioning of the data which gives the wrong conclusion.
I don’t think we disagree on anything important here—the main point is that you need to be careful when choosing which partitions of the data you use—arbitrarily partitioning along every available divide is not optimal.
PS—thanks for the typo correction, I really need to learn to proofread...
Don’t partition on things that are caused by the factor you want to analyze.