I think it’s way higher. Some off the top of my head (with a little reading to confirm details):
Child delivery by doctors in a hospital correlated with puerperal fever. Refined to a correlation between child delivery by someone who had recently performed an autopsy and puerperal fever. Experimentally testing handwashing (though not blind) confirmed effect, doctors wash their hands, dying in childbirth is now less common.
A student with anemia symptoms turns out to have some strangely shaped blood cells. This initial association is expanded by people looking at blood of other people with anemia, and several others also have these elongated red cells. Eventually we get enough of a correlation that we’ve discovered sickle cell anemia.
In fact I would go as far as to say that most of our medical knowledge comes from correlations, often relatively obvious ones like “getting run over by a car increases your chance of death”.
There may still be something here, though: the kinds of studies we see with bad correlations being misleading, and the examples you give of successful ones, are generally small effects compared to the amount of time involved. Can we characterize better this area where correlations are especially suspect?
In fact I would go as far as to say that most of our medical knowledge comes from correlations, often relatively obvious ones like “getting run over by a car increases your chance of death”.
Well, we have to be careful about definitions here. People generally don’t talk about correlations when there is a known underlying mechanism.
I guess technically the phrase should look like this: Correlation by itself without known connecting mechanisms or relationships does not imply causation.
Correlation by itself without known connecting mechanisms or relationships does not imply causation.
The bayesian approach would suggest that we assign a causation-credence to every correlation we observe. Of course detecting confounders is very important since it provides you with updates. However, a correlation without known connecting mechanisms does imply causation. In particular it does it probabilistically. A bayesian updater would prefer talking about credences in causation which can be shifted up and downwards. It would be a (sometimes dangerous) simplification to in our map deal with discrete values like “just correlation” and “real causation”. However, such a simplification may be of use as a heuristic in everyday life, still I’d suggest not to overgeneralize it.
Correlation by itself without known connecting mechanisms or relationships does not imply causation
This does separate out the “getting run over by a car” case, but it doesn’t handle the handwashing one. Germ theory hadn’t been invented yet and Semelweiss’ proposed mechanism was both medically unlikely and wrong. With sickle cell anemia it kind of handles it, in that you can think of all sorts of ways weirdly shaped blood cells might be a problem, but I think it’s a stretch to say that the first people looking at the blood and saying “that’s weird, it’s probably the problem” understood the “connecting mechanisms or relationships”.
More generally, correlation is some evidence and if it’s not expected someone should probably look more closely to try to understand why we’re seeing it, which generally means some kind of controlled experiment.
Well, to start with correlation is data. This data might be used to generate hypotheses. Once you have some hypotheses you can start talking about evidence and yes, correlation can be promoted to the rank of evidence supporting some hypothesis.
I don’t think any of that is controversial. The only point is that pure correlation without anything else is pretty weak evidence, that’s all. However if you want to use it to generate hypotheses, sure, no problems with it whatsoever.
Can we characterize better this area where correlations are especially suspect?
Epidemiological studies of diets (that is, health consequences of particular patterns of food intake) are all based on correlations and the great majority of them is junk.
These days epi people mostly use g methods which are not junk (or rather, give correct answers given assumptions they make, and are quite a bit more sophisticated than just using conditional probabilities). How much epi do you know?
edit: Correction: not everyone uses g methods. There is obviously the “changing of the guard” issue. But g methods are very influential now. I also agree there is a lot of junk in data analysis. But I think the “junk” issue is not always (or even usually) due to the fact that the study was “based on correlations” (you are not being precise about what you mean here, but I interpreted you to mean that “people are not using correct methods for getting causal conclusions from observational data.”)
Not much. I’ve read a bunch of papers and some critiques… And I’m talking not so much about the methods as about the published claims and conclusions. Sophisticated methods are fine, the issue is their fragility. And, of course, you can’t correct for what you don’t know.
I think it’s way higher. Some off the top of my head (with a little reading to confirm details):
Child delivery by doctors in a hospital correlated with puerperal fever. Refined to a correlation between child delivery by someone who had recently performed an autopsy and puerperal fever. Experimentally testing handwashing (though not blind) confirmed effect, doctors wash their hands, dying in childbirth is now less common.
A student with anemia symptoms turns out to have some strangely shaped blood cells. This initial association is expanded by people looking at blood of other people with anemia, and several others also have these elongated red cells. Eventually we get enough of a correlation that we’ve discovered sickle cell anemia.
In fact I would go as far as to say that most of our medical knowledge comes from correlations, often relatively obvious ones like “getting run over by a car increases your chance of death”.
There may still be something here, though: the kinds of studies we see with bad correlations being misleading, and the examples you give of successful ones, are generally small effects compared to the amount of time involved. Can we characterize better this area where correlations are especially suspect?
Well, we have to be careful about definitions here. People generally don’t talk about correlations when there is a known underlying mechanism.
I guess technically the phrase should look like this: Correlation by itself without known connecting mechanisms or relationships does not imply causation.
The bayesian approach would suggest that we assign a causation-credence to every correlation we observe. Of course detecting confounders is very important since it provides you with updates. However, a correlation without known connecting mechanisms does imply causation. In particular it does it probabilistically. A bayesian updater would prefer talking about credences in causation which can be shifted up and downwards. It would be a (sometimes dangerous) simplification to in our map deal with discrete values like “just correlation” and “real causation”. However, such a simplification may be of use as a heuristic in everyday life, still I’d suggest not to overgeneralize it.
This does separate out the “getting run over by a car” case, but it doesn’t handle the handwashing one. Germ theory hadn’t been invented yet and Semelweiss’ proposed mechanism was both medically unlikely and wrong. With sickle cell anemia it kind of handles it, in that you can think of all sorts of ways weirdly shaped blood cells might be a problem, but I think it’s a stretch to say that the first people looking at the blood and saying “that’s weird, it’s probably the problem” understood the “connecting mechanisms or relationships”.
More generally, correlation is some evidence and if it’s not expected someone should probably look more closely to try to understand why we’re seeing it, which generally means some kind of controlled experiment.
Well, to start with correlation is data. This data might be used to generate hypotheses. Once you have some hypotheses you can start talking about evidence and yes, correlation can be promoted to the rank of evidence supporting some hypothesis.
I don’t think any of that is controversial. The only point is that pure correlation without anything else is pretty weak evidence, that’s all. However if you want to use it to generate hypotheses, sure, no problems with it whatsoever.
Are you using Semelweiss as an example of the medical community properly assessing and synthesizing data?
I’m using it as an example of a valuable fact about disease being established by correlation.
Your paragraph speaks about correlation providing a hypothesis while the “fact about disease” was established by an experimental intervention study.
I think we’re getting into a discussion about what it means for something to be established as a fact, which doesn’t sound very useful.
Epidemiological studies of diets (that is, health consequences of particular patterns of food intake) are all based on correlations and the great majority of them is junk.
These days epi people mostly use g methods which are not junk (or rather, give correct answers given assumptions they make, and are quite a bit more sophisticated than just using conditional probabilities). How much epi do you know?
edit: Correction: not everyone uses g methods. There is obviously the “changing of the guard” issue. But g methods are very influential now. I also agree there is a lot of junk in data analysis. But I think the “junk” issue is not always (or even usually) due to the fact that the study was “based on correlations” (you are not being precise about what you mean here, but I interpreted you to mean that “people are not using correct methods for getting causal conclusions from observational data.”)
Not much. I’ve read a bunch of papers and some critiques… And I’m talking not so much about the methods as about the published claims and conclusions. Sophisticated methods are fine, the issue is their fragility. And, of course, you can’t correct for what you don’t know.