The problem is the data is biased. The ML algorithm doesn’t know whether the bias is a natural part of the data or artificially induced. Garbage In—Garbage Out.
You’re making up excuses. The data is not ‘biased’, it just is, nor is it garbage—it’s not made up, no one is lying or falsifying data or anything like that. If your theory cannot handle clean data from a real-world problem, that’s a big problem (especially if there are more sophisticated alternatives which can handle it).
This is not biased data. No one tampered with it. No one preferentially left out some data. There is no Cartesian daemon tampering with you. It’s a perfectly ordinary causal problem for which one has all the available data. If you run a regression on the data, you will get accurate predictions of future similar data—just not what happens when you intervene and realize the counterfactual. You can’t throw your hands up and disdainfully refuse to solve the problem, proclaiming, ‘oh, that’s biased’. It may be hard, and the best available solution weak or require strong assumptions, but if that is the case, the correct method should say as much and specify what additional data or interventions would allow stronger conclusions.
I’m not certain why I used the word “bias”. I think I was getting at that the data isn’t representative of the population of interest.
Regardless, no other method can solve the problem specified without additional information (which you claimed). And with additional information, it’s straightforward prediction again.
That is, condition on their prior health status, not just the fact they’ve been given the drug. And prior probabilities.
No method can solve the problem you’ve given without additional information.
What do you call “solving the problem”?
Any method will output some estimates. Some methods will output better estimates, some worse. As people have pointed out, this was an example of a real problem and yes, real-life data is usually pretty messy. We need methods which can handle messy data and not work just on spherical cows in vacuum.
You’re making up excuses. The data is not ‘biased’, it just is, nor is it garbage—it’s not made up, no one is lying or falsifying data or anything like that. If your theory cannot handle clean data from a real-world problem, that’s a big problem (especially if there are more sophisticated alternatives which can handle it).
Biased data is a real thing and this is a great example. No method can solve the problem you’ve given without additional information.
This is not biased data. No one tampered with it. No one preferentially left out some data. There is no Cartesian daemon tampering with you. It’s a perfectly ordinary causal problem for which one has all the available data. If you run a regression on the data, you will get accurate predictions of future similar data—just not what happens when you intervene and realize the counterfactual. You can’t throw your hands up and disdainfully refuse to solve the problem, proclaiming, ‘oh, that’s biased’. It may be hard, and the best available solution weak or require strong assumptions, but if that is the case, the correct method should say as much and specify what additional data or interventions would allow stronger conclusions.
I’m not certain why I used the word “bias”. I think I was getting at that the data isn’t representative of the population of interest.
Regardless, no other method can solve the problem specified without additional information (which you claimed). And with additional information, it’s straightforward prediction again.
That is, condition on their prior health status, not just the fact they’ve been given the drug. And prior probabilities.
What do you call “solving the problem”?
Any method will output some estimates. Some methods will output better estimates, some worse. As people have pointed out, this was an example of a real problem and yes, real-life data is usually pretty messy. We need methods which can handle messy data and not work just on spherical cows in vacuum.