I don’t get it. You gave some people the drug and some people you didn’t. It seems pretty straightforward to estimate how likely someone is to die if you give them medicine.
It seems pretty straightforward to estimate how likely someone is to die if you give them medicine.
Certainly it’s straightforward. Here’s how one can apply your logic. You gave some people [the ones whose disease has progressed the most] the drug and some people you didn’t [because their disease isn’t so bad you’re willing to risk it]; the % of people dying in the first drugged group is much higher than the % of deaths in the second non-drugged group; therefore, this drug is poison and you’re a mass murderer.
Of course people say “but this is silly, obviously we need to condition on health status.”
The point is: what if we can’t? Or what if we there are other causally relevant factors here? In fact, what is “causally relevant” anyways… We need a system! ML people don’t think about these questions very hard, generally, because culturally they are more interested in “algorithmic approaches” to prediction problems.
(This is a clarification of gwern’s response to the grandparent, not a reply to gwern.)
The problem is the data is biased. The ML algorithm doesn’t know whether the bias is a natural part of the data or artificially induced. Garbage In—Garbage Out.
However it can still be done if the algorithm has more information. Maybe some healthy patients ended up getting the medicine anyways and were far more likely to live, or some unhealthy ones didn’t and were even more likely to die. Now it’s straightforward prediction again: How likely is a patient to live based on their current health and whether or not they take the drug?
The problem is the data is biased. The ML algorithm doesn’t know whether the bias is a natural part of the data or artificially induced. Garbage In—Garbage Out.
You’re making up excuses. The data is not ‘biased’, it just is, nor is it garbage—it’s not made up, no one is lying or falsifying data or anything like that. If your theory cannot handle clean data from a real-world problem, that’s a big problem (especially if there are more sophisticated alternatives which can handle it).
This is not biased data. No one tampered with it. No one preferentially left out some data. There is no Cartesian daemon tampering with you. It’s a perfectly ordinary causal problem for which one has all the available data. If you run a regression on the data, you will get accurate predictions of future similar data—just not what happens when you intervene and realize the counterfactual. You can’t throw your hands up and disdainfully refuse to solve the problem, proclaiming, ‘oh, that’s biased’. It may be hard, and the best available solution weak or require strong assumptions, but if that is the case, the correct method should say as much and specify what additional data or interventions would allow stronger conclusions.
I’m not certain why I used the word “bias”. I think I was getting at that the data isn’t representative of the population of interest.
Regardless, no other method can solve the problem specified without additional information (which you claimed). And with additional information, it’s straightforward prediction again.
That is, condition on their prior health status, not just the fact they’ve been given the drug. And prior probabilities.
No method can solve the problem you’ve given without additional information.
What do you call “solving the problem”?
Any method will output some estimates. Some methods will output better estimates, some worse. As people have pointed out, this was an example of a real problem and yes, real-life data is usually pretty messy. We need methods which can handle messy data and not work just on spherical cows in vacuum.
I don’t get it. You gave some people the drug and some people you didn’t. It seems pretty straightforward to estimate how likely someone is to die if you give them medicine.
Certainly it’s straightforward. Here’s how one can apply your logic. You gave some people [the ones whose disease has progressed the most] the drug and some people you didn’t [because their disease isn’t so bad you’re willing to risk it]; the % of people dying in the first drugged group is much higher than the % of deaths in the second non-drugged group; therefore, this drug is poison and you’re a mass murderer.
See the problem?
Of course people say “but this is silly, obviously we need to condition on health status.”
The point is: what if we can’t? Or what if we there are other causally relevant factors here? In fact, what is “causally relevant” anyways… We need a system! ML people don’t think about these questions very hard, generally, because culturally they are more interested in “algorithmic approaches” to prediction problems.
(This is a clarification of gwern’s response to the grandparent, not a reply to gwern.)
The problem is the data is biased. The ML algorithm doesn’t know whether the bias is a natural part of the data or artificially induced. Garbage In—Garbage Out.
However it can still be done if the algorithm has more information. Maybe some healthy patients ended up getting the medicine anyways and were far more likely to live, or some unhealthy ones didn’t and were even more likely to die. Now it’s straightforward prediction again: How likely is a patient to live based on their current health and whether or not they take the drug?
You’re making up excuses. The data is not ‘biased’, it just is, nor is it garbage—it’s not made up, no one is lying or falsifying data or anything like that. If your theory cannot handle clean data from a real-world problem, that’s a big problem (especially if there are more sophisticated alternatives which can handle it).
Biased data is a real thing and this is a great example. No method can solve the problem you’ve given without additional information.
This is not biased data. No one tampered with it. No one preferentially left out some data. There is no Cartesian daemon tampering with you. It’s a perfectly ordinary causal problem for which one has all the available data. If you run a regression on the data, you will get accurate predictions of future similar data—just not what happens when you intervene and realize the counterfactual. You can’t throw your hands up and disdainfully refuse to solve the problem, proclaiming, ‘oh, that’s biased’. It may be hard, and the best available solution weak or require strong assumptions, but if that is the case, the correct method should say as much and specify what additional data or interventions would allow stronger conclusions.
I’m not certain why I used the word “bias”. I think I was getting at that the data isn’t representative of the population of interest.
Regardless, no other method can solve the problem specified without additional information (which you claimed). And with additional information, it’s straightforward prediction again.
That is, condition on their prior health status, not just the fact they’ve been given the drug. And prior probabilities.
What do you call “solving the problem”?
Any method will output some estimates. Some methods will output better estimates, some worse. As people have pointed out, this was an example of a real problem and yes, real-life data is usually pretty messy. We need methods which can handle messy data and not work just on spherical cows in vacuum.