I still don’t understand in which sense do you use the word “real” in ‘correlation is “real”’.
Let’s say you have two time series 100 data points in length each. You calculate their correlation, say, Pearson’s correlation. It’s a number. In which sense can that number be “real” or “not real”?
Do you implicitly have in mind the sampling theory where what you observe is a sample estimate and what’s “real” is the true parameter of the unobserved underlying process? In this case there is a very large body of research that mostly goes by the name of “frequentist statistics” about figuring out what does your sample estimate tell you about the unobserved true value (to call which “real” is a bit of stretch since normally it is not real).
Er… no. Okay, look, here’s the definition I provided from an earlier comment:
By “real correlation” I mean a correlation that is not simply an artifact of your statistical analysis, but is actually “present in the data”, so to speak.
You seemed to understand this well enough to engage with it, even going so far as to ask me how I would distinguish between the two (answer: redundancy), but now you’re saying that I’m using “real” to mean “matching my current ideas of what’s likely”? If there’s something in the quote that you don’t understand, please feel free to ask, but right now I’m feeling a bit bewildered by the fact that you seem to have entirely forgotten that definition.
All measured correlations are “actually present in the data”. If you take two data series and calculate their correlation it would be a number. This measured (or sample) correlation is certainly real and not fake. The question is what does it represent.
You claim the ability to decide—on a completely unclear to me basis—that sometimes this measured correlation represents something (and then you call it “real”) and sometimes it represents nothing (and then you call it “not real”). “Redundancy” is not an adequate answer because all it means is that you will re-measure your sample again and, not surprisingly, will get similar results because it’s still the same data. As an example of “not real” correlation you offered the graphs from the linked page, but I see no reason for you to declare them “not real” other than because it does not look likely to you.
All measured correlations are “actually present in the data”. If you take two data series and calculate their correlation it would be a number. This measured (or sample) correlation is certainly real and not fake. The question is what does it represent.
Depending on which statistical method you use, the number you calculate may not be the number you’re looking for, or the number you’d have gotten had you used some other method. If you don’t like my use of the word “real” to denote this, feel free to substitute some other word—”representative”, maybe. By “redundancy” I’m not referring to the act of analyzing the data multiple times; I’m referring to using multiple methods to do so and seeing if you get the same result each time (possibly checking with a friend or two in the process).
As an example of “not real” correlation you offered the graphs from the linked page, but I see no reason for you to declare them “not real” other than because it does not look likely to you.
No, I am declaring them “not real” because they were calculated using a statistical method widely regarded as suspect. This suspect method is known to produce correlations that are called “spurious”, and my link in the grandparent comment was to this method’s Wikipedia page. I’m not sure if you thought the link I provided led to the original page you linked, but as you made no mention of “spurious correlations” (the method, not the page), I thought I’d mention it again.
I still don’t understand in which sense do you use the word “real” in ‘correlation is “real”’.
Let’s say you have two time series 100 data points in length each. You calculate their correlation, say, Pearson’s correlation. It’s a number. In which sense can that number be “real” or “not real”?
Do you implicitly have in mind the sampling theory where what you observe is a sample estimate and what’s “real” is the true parameter of the unobserved underlying process? In this case there is a very large body of research that mostly goes by the name of “frequentist statistics” about figuring out what does your sample estimate tell you about the unobserved true value (to call which “real” is a bit of stretch since normally it is not real).
It seems as though my attempts to define my term intensionally aren’t working, so I’ll try and give an extensional definition instead:
An example would be that site you linked earlier. Those quantities appear to be correlated, but the correlations are not “real”.
So you are using “real” in the sense of “matching my current ideas of what’s likely”. I think this approach is likely to… lead to problems.
Er… no. Okay, look, here’s the definition I provided from an earlier comment:
You seemed to understand this well enough to engage with it, even going so far as to ask me how I would distinguish between the two (answer: redundancy), but now you’re saying that I’m using “real” to mean “matching my current ideas of what’s likely”? If there’s something in the quote that you don’t understand, please feel free to ask, but right now I’m feeling a bit bewildered by the fact that you seem to have entirely forgotten that definition.
See also: spurious correlation.
Sigh.
All measured correlations are “actually present in the data”. If you take two data series and calculate their correlation it would be a number. This measured (or sample) correlation is certainly real and not fake. The question is what does it represent.
You claim the ability to decide—on a completely unclear to me basis—that sometimes this measured correlation represents something (and then you call it “real”) and sometimes it represents nothing (and then you call it “not real”). “Redundancy” is not an adequate answer because all it means is that you will re-measure your sample again and, not surprisingly, will get similar results because it’s still the same data. As an example of “not real” correlation you offered the graphs from the linked page, but I see no reason for you to declare them “not real” other than because it does not look likely to you.
Depending on which statistical method you use, the number you calculate may not be the number you’re looking for, or the number you’d have gotten had you used some other method. If you don’t like my use of the word “real” to denote this, feel free to substitute some other word—”representative”, maybe. By “redundancy” I’m not referring to the act of analyzing the data multiple times; I’m referring to using multiple methods to do so and seeing if you get the same result each time (possibly checking with a friend or two in the process).
No, I am declaring them “not real” because they were calculated using a statistical method widely regarded as suspect. This suspect method is known to produce correlations that are called “spurious”, and my link in the grandparent comment was to this method’s Wikipedia page. I’m not sure if you thought the link I provided led to the original page you linked, but as you made no mention of “spurious correlations” (the method, not the page), I thought I’d mention it again.