Second, I don’t believe you. I say it’s always smarter to use the partitioned data than the aggregate data. If you have a data set that includes the gender of the subject, you’re always better off building two models (one for each gender) instead of one big model. Why throw away information?
If you believe the OP’s assertion
Similarly, for just about any given set of data, you can find some partition which reverses the apparent correlation
then it is demonstrably false that your strategy always improves matters. Why do you believe that your strategy is better?
Partitioning may reverse the correlation or it may not; either way, it provides a more accurate model.
Let’s do this formally. Let R, G, and T be the three variables of interest in the OP’s example, corresponding to Recovery, Gender, and Treatment. Then the goal is to obtain a model of the probability of R, given T and maybe G. My assertion is that a model of the form P(R|G,T) is always going to be more accurate than a model of the form P(R|T) alone—you can’t gain anything by throwing away the G variable. The accuracy can be measured in terms of the log-likelihood of the data given the model.
It is actually tautologically true that P(R|G,T) will provide a higher log-likelihood than P(R|T). The issue raised by RobinZ is that P(R|G,T) might overfit the data: the accuracy improvement achieved by including G might not justify the increase in model complexity. That will certainly happen if naive modeling methods are used, but there are ways to incorporate multiple information sources without overfitting.
Partitioning may reverse the correlation or it may not; either way, it provides a more accurate model.
Usually. But, partitioning reduces the number of samples within each partition, and can thus increase the effects of chance. This is even worse if you have a lot of variables floating around that you can partition against. At some point it becomes easy to choose a partition that purely by coincidence is apparently very predictive on this data set, but that actually has no causal role.
RobinZ is that P(R|G,T) might overfit the data: the accuracy improvement achieved by including G might not justify the increase in model complexity.
My assertion is that a model of the form P(R|G,T) is always going to be more accurate than a model of the form P(R|T) alone—you can’t gain anything by throwing away the G variable.
That’s all true (modulo the objection about overfitting). However, there is the case where T affects G which in turn affects R. (Presumably this doesn’t apply when T = treatment and G = gender). If what we’re interested in is the effect of T on R (irrespective of which other variables ‘transmit’ the causal influence) then conditioning on G may obscure the pattern we’re trying to detect.
(Apologies for not writing the above paragraph using rigorous language, but hopefully the point is obvious enough.)
If you believe the OP’s assertion
then it is demonstrably false that your strategy always improves matters. Why do you believe that your strategy is better?
Partitioning may reverse the correlation or it may not; either way, it provides a more accurate model.
Let’s do this formally. Let R, G, and T be the three variables of interest in the OP’s example, corresponding to Recovery, Gender, and Treatment. Then the goal is to obtain a model of the probability of R, given T and maybe G. My assertion is that a model of the form P(R|G,T) is always going to be more accurate than a model of the form P(R|T) alone—you can’t gain anything by throwing away the G variable. The accuracy can be measured in terms of the log-likelihood of the data given the model.
It is actually tautologically true that P(R|G,T) will provide a higher log-likelihood than P(R|T). The issue raised by RobinZ is that P(R|G,T) might overfit the data: the accuracy improvement achieved by including G might not justify the increase in model complexity. That will certainly happen if naive modeling methods are used, but there are ways to incorporate multiple information sources without overfitting.
Usually. But, partitioning reduces the number of samples within each partition, and can thus increase the effects of chance. This is even worse if you have a lot of variables floating around that you can partition against. At some point it becomes easy to choose a partition that purely by coincidence is apparently very predictive on this data set, but that actually has no causal role.
Exactly.
That’s all true (modulo the objection about overfitting). However, there is the case where T affects G which in turn affects R. (Presumably this doesn’t apply when T = treatment and G = gender). If what we’re interested in is the effect of T on R (irrespective of which other variables ‘transmit’ the causal influence) then conditioning on G may obscure the pattern we’re trying to detect.
(Apologies for not writing the above paragraph using rigorous language, but hopefully the point is obvious enough.)