Intuitively, the existence of categories at 2% and 3% make the conclusion clear. If the 1995 data isn’t made up, then it is very rare that a project falls into one of these categories at all—respectively 1⁄50 and 1⁄30 chances. So the chance that our small sample of 9 projects happens to contain one each of these kinds of projects is very small to start with, about 9⁄150.
Given that we know nothing about how the projects themselves were distributed between the categories, we can’t actually say this with any confidence. It’s possible, for example, that the 2% category actually receives many projects on average, but they’re all cheap.
If you assume that the project costs are normally distributed, then that assumption makes the 1979 data inherently unlikely, no matter how close the percentages are to 1995: the existence of a category receiving 2% of the funding means that at best you have a data point which is only 18% of the mean (and another point at 27%). That just doesn’t happen for normal distributions (unless the variance is so large that the model becomes ridiculous anyway, due to the huge probability of it giving you negative numbers).
It’s actually quite plausible that cheaper projects have a greater chance of falling into the rare category of successful projects, as the original 1979 defined success—“used without extensive rework”. It’s also quite possible that project size isn’t normally distributed.
What I seem to have trouble conveying is my intuition that the fit is too close to be true—that in general if you have a multinomial distribution with five categories, and you draw a small sample from that distribution, it is quite unlikely that your sample frequencies will come within 1% of the true probabilities.
The chi-squared test, if I’ve understood it correctly, computes the converse probability—the probability that your sample contains frequencies that are this far removed or more from the true probabilities, given the assumption that it’s drawn from a distribution with those probabilities. In the case that concerns me the chi-square is obviously very small, so that the p-value approaches unity.
What I’m saying—and it may be a crazy thing to say—is that it’s precisely this small distance from the true probabilities that makes me suspicious.
I realize what you’re getting at, and it is suspicious, I’m just saying that the probabilities you’re trying to calculate for it aren’t correct.
I’m also not sure what your alternate hypotheses are. There’s no way that the 1979 data was fabricated to fit the 1995 percentages, is there? So any argument that casts doubt on the 1979 data being possible to begin with is going to penalize all possible alternate hypotheses. That’s the problem with the normality assumption: assuming a normal distribution with any true mean makes the 1979 data unlikely, whether or not the percentages are suspiciously close.
I’ve just come across a more technical explanation than usual of “The Mendel-Fisher Controversy” which frames it as having been about formalizing an intuition of data “too good to be true” using chi-squared.
It is less well known, however, that in 1936, the great British statistician and biologist R. A. Fisher analyzed Mendel’s data and found that the fit to Mendel’s theoretical expectations was too good (Fisher 1936). Using χ2 analysis, Fisher found that the probability of obtaining a fit as good as Mendel’s was only 7 in 100,000. (source)
Incidentally a very high P (>0.9) is suspicious, as it means that the results are just too good to be true! This suggests that there is some bias in the experiment, whether deliberate or accidental.
So, ISTM, gwern’s analysis here leads to the “too good to be true” conclusion.
There’s no way that the 1979 data was fabricated to fit the 1995 percentages, is there?
No, I’m quite confident the 1979 document is genuine (call it 100% minus a hair). Just what the data represents is something else again—by the authors’ own admission they worked with a biased sample.
The 1995 sample, assuming it is genuine, is quite unbiased—since it is (claimed to be) the entire population.
I’m also not sure what your alternate hypotheses are.
To me it seems quite likely that the 1995 “results” are artifactual: my main theory is that someone heard an oral presentation from the person cited as the author, conflated that presentation in their mind with the 1979 data, and a few years later presented a chimera of the two, attributing it to the speaker. Later authors just copied and pasted the claim and reference, neglecting to fact-check it.
the probabilities you’re trying to calculate for it aren’t correct
I’m willing to accept that. But if we agree that the close fit is suspicious, then I would hazard that we have some mathematical background for that intuition, and if so there must be at least some way of formalizing that intuition which is better than saying “I just don’t know”.
Conversely, if that intuition is in fact ungrounded (perhaps for the same reason we call “too improbable to be a coincidence” a winning lottery draw which pattern-matches something significant to us, like a birth date), there should be a way of formalizing that.
Given that we know nothing about how the projects themselves were distributed between the categories, we can’t actually say this with any confidence. It’s possible, for example, that the 2% category actually receives many projects on average, but they’re all cheap.
If you assume that the project costs are normally distributed, then that assumption makes the 1979 data inherently unlikely, no matter how close the percentages are to 1995: the existence of a category receiving 2% of the funding means that at best you have a data point which is only 18% of the mean (and another point at 27%). That just doesn’t happen for normal distributions (unless the variance is so large that the model becomes ridiculous anyway, due to the huge probability of it giving you negative numbers).
It’s actually quite plausible that cheaper projects have a greater chance of falling into the rare category of successful projects, as the original 1979 defined success—“used without extensive rework”. It’s also quite possible that project size isn’t normally distributed.
What I seem to have trouble conveying is my intuition that the fit is too close to be true—that in general if you have a multinomial distribution with five categories, and you draw a small sample from that distribution, it is quite unlikely that your sample frequencies will come within 1% of the true probabilities.
The chi-squared test, if I’ve understood it correctly, computes the converse probability—the probability that your sample contains frequencies that are this far removed or more from the true probabilities, given the assumption that it’s drawn from a distribution with those probabilities. In the case that concerns me the chi-square is obviously very small, so that the p-value approaches unity.
What I’m saying—and it may be a crazy thing to say—is that it’s precisely this small distance from the true probabilities that makes me suspicious.
I realize what you’re getting at, and it is suspicious, I’m just saying that the probabilities you’re trying to calculate for it aren’t correct.
I’m also not sure what your alternate hypotheses are. There’s no way that the 1979 data was fabricated to fit the 1995 percentages, is there? So any argument that casts doubt on the 1979 data being possible to begin with is going to penalize all possible alternate hypotheses. That’s the problem with the normality assumption: assuming a normal distribution with any true mean makes the 1979 data unlikely, whether or not the percentages are suspiciously close.
I’ve just come across a more technical explanation than usual of “The Mendel-Fisher Controversy” which frames it as having been about formalizing an intuition of data “too good to be true” using chi-squared.
And this PDF or this page say pretty much the same.
So, ISTM, gwern’s analysis here leads to the “too good to be true” conclusion.
No, I’m quite confident the 1979 document is genuine (call it 100% minus a hair). Just what the data represents is something else again—by the authors’ own admission they worked with a biased sample.
The 1995 sample, assuming it is genuine, is quite unbiased—since it is (claimed to be) the entire population.
To me it seems quite likely that the 1995 “results” are artifactual: my main theory is that someone heard an oral presentation from the person cited as the author, conflated that presentation in their mind with the 1979 data, and a few years later presented a chimera of the two, attributing it to the speaker. Later authors just copied and pasted the claim and reference, neglecting to fact-check it.
I’m willing to accept that. But if we agree that the close fit is suspicious, then I would hazard that we have some mathematical background for that intuition, and if so there must be at least some way of formalizing that intuition which is better than saying “I just don’t know”.
Conversely, if that intuition is in fact ungrounded (perhaps for the same reason we call “too improbable to be a coincidence” a winning lottery draw which pattern-matches something significant to us, like a birth date), there should be a way of formalizing that.