I’m trying to get at least a vague handle on what I can legitimately infer from what using data that might, and probably does, contain circular causation. I’m looking for statistical tools that might help me do that. Should I try Bayesian causal inference anyway, just to see what I get? Support vector machines? Markov random fields? Does the Spurious Correlations book have ideas on that? (No, it just seems to be an awesome set of correlations. Thanks, BTW.)
(Also notice that these are not just any correlations. These are the strongest correlations that pertain among a large number of variables relative to each other. I mean, I computed all possible correlations among every combination of 2 variables in hopes that the strongest I find for each variable might show something interesting.)
I’m trying to get at least a vague handle on what I can legitimately infer
That’s not a very well-defined goal. You are engaging in what’s known as a spaghetti factory analysis: make a lot of spaghetti, throw it on the wall, pick the most interesting shapes. This doesn’t tell you anything about the world.
Sure, you can start with correlations. But that’s only a start. Let’s say you’ve got a high correlation between A and B. The next questions should be: Does it make sense? Is there a plausible mechanism underlying this correlation? Is it stable in time? Is it meaningful? And that’s before diving into causality which correlations won’t help you much with.
You still need a better goal of the analysis.
Should I try Bayesian causal inference anyway, just to see what I get? Support vector machines? Markov random fields?
Nooooo! You don’t understand basic stats, trying to (mis)use complicated tools will just let you confuse yourself more thoroughly.
Sure, I can always offer my own interpretations, but the whole idea was to minimize that as much as possible. I can rationalize anything. Watch: Milk consumption is negatively correlated with income inequality. Drinking less milk leads to stunted intelligence, resulting in a rise in income inequality. Or income inequality leads to a drop in milk consumption among poor families. Or the alien warlord Thon-Gul hates milk and equal incomes.
What conditions must my goal satisfy in order to qualify as a “well-defined goal”? Have I made any actual (meaning technical) mistakes so far? (Anyway, thanks for reminding me to check for temporal stability. I should write a script to scrape the data off pdfs. (Never mind, I found a library.))
the whole idea was to minimize that as much as possible
I believe this idea to be misguided. The point of the process is to understand. You can’t understand without “interpretation”—looking for just the biggest numbers inevitably leads you astray.
The issue isn’t what you can rationalize—“don’t be stupid” is still the baseline, level zero criterion.
What conditions must my goal satisfy in order to qualify as a “well-defined goal”?
A specification of what kind of answers will be acceptable and what kind will not.
Have I made any actual (meaning technical) mistakes so far?
Are you asking whether your spaghetti factory mixes flour and water in the right ratio?
Not being stupid is an admirable goal, but it’s not well-defined. I tried Googling “spaghetti factory analysis” and “spaghetti factory analysis statistics” for more information, but it’s not turning up anything. Is there a standard term for the error you are referring to?
Can’t I have my common sense, but make all possible comparisons anyway just to inform my common sense as to the general directions in which the winds of evidence are blowing?
I don’t see how informing myself of correlations harms my common sense in any way, and the only alternative I can think of is to stick to my prejudices, but whenever some doubt arises as to which of my prejudices has a stronger claim, I should thoroughly investigate real world data to settle the dispute between the two. As soon as that process is over, I should stop immediately because nothing else matters.
Not being stupid is an admirable goal, but it’s not well-defined.
It’s not a goal. It is a criterion you should apply to the steps which you intend to take. I admit to it not being well-defined :-)
Is there a standard term for the error you are referring to?
In statistics that used to be called “data mining” and was a bad thing. Data science repurposed the term and it’s now a good thing :-/ Andrew Gelman calls a similar phenomenon “garden of the forking paths” (see e.g. here).
Basically the problem is paying attention to noise.
Can’t I have my common sense, but make all possible comparisons anyway
You can. It’s just that you shouldn’t attach undue importance to which comparison came the first and which the second. You’re generating estimates and at the very minimum you should also be generating what you think are the errors of your estimates—these should be helpful in establishing how meaningful your ranking of all the pairs is.
And you still need to define a goal. For example, a goal of explanation/understanding is different from the goal of forecasting.
I’m not telling you to ignore the data. I’m telling you to be sceptical of what the data is telling you.
Thank you! Those data mining algorithms are exactly what I was looking for.
(Personally, I would describe the situation you are warning me against as reducing it “more than is possible” rather than “as much as possible”. I am definitely in favor of using common sense.)
I’m trying to get at least a vague handle on what I can legitimately infer from what using data that might, and probably does, contain circular causation. I’m looking for statistical tools that might help me do that. Should I try Bayesian causal inference anyway, just to see what I get? Support vector machines? Markov random fields? Does the Spurious Correlations book have ideas on that? (No, it just seems to be an awesome set of correlations. Thanks, BTW.)
(Also notice that these are not just any correlations. These are the strongest correlations that pertain among a large number of variables relative to each other. I mean, I computed all possible correlations among every combination of 2 variables in hopes that the strongest I find for each variable might show something interesting.)
That’s not a very well-defined goal. You are engaging in what’s known as a spaghetti factory analysis: make a lot of spaghetti, throw it on the wall, pick the most interesting shapes. This doesn’t tell you anything about the world.
Sure, you can start with correlations. But that’s only a start. Let’s say you’ve got a high correlation between A and B. The next questions should be: Does it make sense? Is there a plausible mechanism underlying this correlation? Is it stable in time? Is it meaningful? And that’s before diving into causality which correlations won’t help you much with.
You still need a better goal of the analysis.
Nooooo! You don’t understand basic stats, trying to (mis)use complicated tools will just let you confuse yourself more thoroughly.
Sure, I can always offer my own interpretations, but the whole idea was to minimize that as much as possible. I can rationalize anything. Watch: Milk consumption is negatively correlated with income inequality. Drinking less milk leads to stunted intelligence, resulting in a rise in income inequality. Or income inequality leads to a drop in milk consumption among poor families. Or the alien warlord Thon-Gul hates milk and equal incomes.
What conditions must my goal satisfy in order to qualify as a “well-defined goal”? Have I made any actual (meaning technical) mistakes so far? (Anyway, thanks for reminding me to check for temporal stability. I should write a script to scrape the data off pdfs. (Never mind, I found a library.))
I believe this idea to be misguided. The point of the process is to understand. You can’t understand without “interpretation”—looking for just the biggest numbers inevitably leads you astray.
The issue isn’t what you can rationalize—“don’t be stupid” is still the baseline, level zero criterion.
A specification of what kind of answers will be acceptable and what kind will not.
Are you asking whether your spaghetti factory mixes flour and water in the right ratio?
Not being stupid is an admirable goal, but it’s not well-defined. I tried Googling “spaghetti factory analysis” and “spaghetti factory analysis statistics” for more information, but it’s not turning up anything. Is there a standard term for the error you are referring to?
Can’t I have my common sense, but make all possible comparisons anyway just to inform my common sense as to the general directions in which the winds of evidence are blowing?
I don’t see how informing myself of correlations harms my common sense in any way, and the only alternative I can think of is to stick to my prejudices, but whenever some doubt arises as to which of my prejudices has a stronger claim, I should thoroughly investigate real world data to settle the dispute between the two. As soon as that process is over, I should stop immediately because nothing else matters.
Is that the course of action you recommend?
It’s not a goal. It is a criterion you should apply to the steps which you intend to take. I admit to it not being well-defined :-)
In statistics that used to be called “data mining” and was a bad thing. Data science repurposed the term and it’s now a good thing :-/ Andrew Gelman calls a similar phenomenon “garden of the forking paths” (see e.g. here).
Basically the problem is paying attention to noise.
You can. It’s just that you shouldn’t attach undue importance to which comparison came the first and which the second. You’re generating estimates and at the very minimum you should also be generating what you think are the errors of your estimates—these should be helpful in establishing how meaningful your ranking of all the pairs is.
And you still need to define a goal. For example, a goal of explanation/understanding is different from the goal of forecasting.
I’m not telling you to ignore the data. I’m telling you to be sceptical of what the data is telling you.
Thank you! Those data mining algorithms are exactly what I was looking for.
(Personally, I would describe the situation you are warning me against as reducing it “more than is possible” rather than “as much as possible”. I am definitely in favor of using common sense.)