gwern comments on How to Evaluate Data?

gwern 10 Apr 2013 20:45 UTC
2 points
0

I don’t know how or when to use a chi-squared test. What I did was assume—for the sake of checking my intuition—that the two sets of frequencies were indeed not made up.

It’s the usual go-to frequentist test for comparing two sets of categorical data. You say you have 4 categories with 10/4/9/3 members and you have your null hypothesis and you’re interested in how often, assuming the null, results as extreme or more extreme than your new data of 200/80/150/20 would appear. Like rolling a biased 4-sided dice.

(If you’re curious, that specific made up example would be chisq.test(matrix(c(10,4,9,3,200,80,150,20), ncol = 2),) with a p-value of 0.4.)

The 1995 “study” has a sample size of $37Bn—this in fact turns out to match estimates of the entire DoD spend on IT projects in that year. So if these numbers are correct, then the frequencies must be precisely the probabilities for any given project to fall into the buckets A, B, C, D or E. What I did next was work out some reasonable assumptions for the 1979 set of frequencies. It is drawn from a sample of 9 projects totaling $6.5M, so the mean project cost in the sample is $755K, and knowing a few other facts we can compute a lower bound for the standard deviation of the sample.

This seems like a really weird procedure. You should be looking at the frequencies of each of the 4 categories, not messing around with means and standard deviations. (I mean heck, just what about 2 decades of inflation or military growth or cutbacks?) What, you think that the 1995 data implies that the Pentagon had $37bn/$755K=49006 different projects?

I don’t know Python or NumPY and your formatting is messed up, so I’m not sure what exactly you’re doing. (One nice thing about using precanned routines like R’s chisq.test: at least it’s relatively clear what you’re doing.)

But we also find an earlier (1979) study, with a more credible primary source. Its five categories are labeled exactly the same, its sample size is much smaller − 9 projects for $7 million total. The allocation is nearly the same: A:47%, B: 29%, C: 19%, D: 3%, E: 2%.

Looking closer, I’m not sure this data makes sense. 0.02 9 is… 0.18. Not a whole number. 47% 9 is 4.23. Also not a positive integer or zero. 0.29 * 9 is 2.61.

Sure, the percentages do sum to 100%, but D and E aren’t even possible: ¹⁄₉ = 11%!
- Morendil 12 Apr 2013 17:03 UTC
  2 points
  0
  Parent
  
  Looking closer, I’m not sure this data makes sense. 0.02 * 9 is… 0.18. Not a whole number.
  
  Basically, that’s you saying exactly what is making me say “the coincidence is implausible”. A sample of 9 will generally not contain an instance of something that comes up 2% of the time. Even more seldom will it contain that and an instance of something that comes up 3% of the time.
  
  So, in spite of appearances, it seems as if our respective intuitions agree on something. Which makes me even more curious as to which of us is having a clack and where.
  - gwern 12 Apr 2013 17:26 UTC
    2 points
    0
    Parent
    No, my point there was that in a discrete sample of 9 items, 2% simply isn’t possible. You jump from ¹⁄₉ (11%) straight to ⁰⁄₉ (0%). But you then explained this impossibility as being the percentage of the total budget of all sampled projects that could be classified that way, which doesn’t make the percentage mean much to me.
- Morendil 10 Apr 2013 20:49 UTC
  2 points
  0
  Parent
  
  not sure this data makes sense. 0.02 * 9 is… 0.18. Not a whole number
  
  The proportions are by cost, not by counts. The 2% is one $118K project, which works out to 1.7% of the $6.8M total, rounded up to 2%.
  - gwern 10 Apr 2013 20:55 UTC
    0 points
    0
    Parent
    So you don’t even know how many projects are in each category for the original study?
    - Morendil 10 Apr 2013 20:58 UTC
      2 points
      0
      Parent
      Nope, aggregates is all we get to work with, no raw data.
      - gwern 10 Apr 2013 21:26 UTC
        0 points
        0
        Parent
        Yeah, I don’t think you can do anything with this sort of data. And even if you had more data, I’m not sure whether you could conclude much of anything—almost identical percentages are always going to be highly likely, even if you go from a sample of 9 to a sample of 47000 or whatever. I’ll illustrate. Suppose instead of being something useless like fraction of expenditure, your 1970s datapoint was exactly 100 projects, 49 of which were classified A, 29 of which were classified B, etc (we interpret the percentages as frequencies and don’t get any awkward issues of “the average person has 1.9 arms”); and we took the mean and then estimated the $29b datapoint as having the same mean per project so we could indeed estimate that it was a sample of $37bn, and so the second sample was 490 times bigger (49k / 100), so when we look at A being 47% in the first sample we have n=47 projects, but when we look at A being 46% in the second sample, we this time have an n of 46*490=22540 projects. Straightforward enough, albeit an exercise in making stuff up.
        
        So, with a sample 490 times larger, does differing by a percent or two offer any reason to reject the null that they have the same underlying distributions? No, because they’re still so similar:
        
        R> chisq.test(matrix(c(47,29,19,3,2, 46*490,29*490,20*490,3*490,2*490), ncol = 2), simulate.p.value = TRUE, B = 20000000) Pearson's Chi-squared test with simulated p-value (based on 2e+07 replicates) data: matrix(c(47, 29, 19, 3, 2, 46 * 490, 29 * 490, 20 * 490, 3 * 490, 2 * 490), ncol = 2) X-squared = 0.0716, df = NA, p-value = 0.9983
        What links here?
        Morendil's comment on How to Evaluate Data? by jetm (15 Apr 2013 15:56 UTC; 0 points)
        Morendil 12 Apr 2013 9:35 UTC
        2 points
        0
        Parent
        
        Yeah, I don’t think you can do anything with this sort of data.
        
        I don’t see why I should give up just because what I’ve got isn’t convenient to work with. The data is what it is, I want to use it in a Bayesian update of my prior probabilities that the 1995 data is kosher or made up.
        
        Intuitively, the existence of categories at 2% and 3% make the conclusion clear. If the 1995 data isn’t made up, then it is very rare that a project falls into one of these categories at all—respectively ¹⁄₅₀ and ¹⁄₃₀ chances. So the chance that our small sample of 9 projects happens to contain one each of these kinds of projects is very small to start with, about ⁹⁄₁₅₀. Immediately this is strong Bayesian evidence against the null hypothesis.
        
        Do you disagree?
        
        My more elaborate procedure is only trying to refine this judgment by taking into account the entire joint probability distribution and trying to “hug the query” as much as possible. With the simulation I can not only pinpoint how astronomically unlikely the coincidence is, but also tell you how much “slop” in categories would be plausible. (If you look for a match within 5% rather than within 1%, then the probability of a coincidence rises to less-than-significant.)
        
        I don’t have to assume anything at all about the 1995 data (such as how many projects it represents), because as I’ve stated earlier $37B is the entire DoD spend in that year—if the data isn’t made up then it amounts to an exhaustive survey rather than a sampling, and thus the observed frequencies are population frequencies. I treat the 1995 data as “truth”, and only need to view the 1979 as a sampling procedure.
        
        Here is a corrected version of the code. I’ve also fixed the SD of the sample, which I miscalculated the first time around.
        
        (My reasoning is as follows: assume the costs of the projects are drawn from a normal distribution. Then we already know the mean ($6.8 / 9 = $755K), we know that one project cost $119K and another $198K (accounting for the 2% and 3% categories respectively), so the “generous” assumption is that the other 7 projects were all the same size ($926K), giving us the tightest normal possible.)
        gwern 12 Apr 2013 15:58 UTC
        1 point
        0
        Parent
        
        I don’t see why I should give up just because what I’ve got isn’t convenient to work with. The data is what it is, I want to use it in a Bayesian update of my prior probabilities that the 1995 data is kosher or made up.
        
        Well heck, no one can stop you from intellectual masturbating. Just because it emits nothing anyone else wants to touch is not a reason to avoid doing it.
        
        But you’re working with made up data, the only real data is a high level summary which doesn’t tell you what you want to know, you have no reasonably defined probability distribution, no defensible priors, and you’re working towards justifying a conclusion you reached days ago (this exercise is a perfect example of motivated reasoning: “I dislike this data, and it turns out I am right since some of it was completely made up, and now I’m going to prove I’m extra-right by exhibiting some fancy statistical calculations involving a whole bunch of buried assumptions and choices which justify the already written bottom line”).
        
        My more elaborate procedure is only trying to refine this judgment by taking into account the entire joint probability distribution and trying to “hug the query” as much as possible. With the simulation I can not only pinpoint how astronomically unlikely the coincidence is, but also tell you how much “slop” in categories would be plausible. (If you look for a match within 5% rather than within 1%, then the probability of a coincidence rises to less-than-significant.)
        
        I’ve already pointed out that under a reasonable interpretation of the imaginary data, the observed frequencies are literally the most likely outcome. Would your procedure make any sense if run on, say, lottery tickets?
        
        I don’t have to assume anything at all about the 1995 data (such as how many projects it represents), because as I’ve stated earlier $37B is the entire DoD spend in that year—if the data isn’t made up then it amounts to an exhaustive survey rather than a sampling, and thus the observed frequencies are population frequencies...My reasoning is as follows: assume the costs of the projects are drawn from a normal distribution.
        
        As I said. Assumptions.
        
        Here is a corrected version of the code. I’ve also fixed the SD of the sample, which I miscalculated the first time around.
        
        Although it’s true that even if you make stuff up and choose to interpret things weirdly in order to justify the conclusion, the code should at least do what you wanted it to.
        Morendil 12 Apr 2013 16:51 UTC
        2 points
        0
        Parent
        Do you disagree that the presence in a small sample of two instances of very rare species constitutes strong prima facie evidence against the “coincidence” hypothesis?
        
        I’ve already pointed out that under a reasonable interpretation of the imaginary data, the observed frequencies are literally the most likely outcome. Would your procedure make any sense if run on, say, lottery tickets?
        
        I don’t know what you mean by the above, despite doing my best to understand. My intuition is that “the most likely outcome” is one in which our 9-project sample will contain no project in either of the “very rare” categories, or at best will have a project in one of them. (If you deal me nine poker hands, I do not expect to see three-of-a-kind in two of them.)
        
        I didn’t understand your earlier example using chi-squared, which is what I take you to mean by “already pointed out”. You made up some data, and “proved” that chi-squared failed to reject the null when you asked it about the made-up data. You assumed a sample size of 100, when the implausibility of the coincidence hypothesis comes precisely from the much smaller sample size (plus the existence of “rare” categories and the overall number of categories).
        
        a perfect example of motivated reasoning
        
        I’m experiencing it as the opposite—I already have plenty of reasons to conclude that the 1995 data set doesn’t exist, I’m trying to give it the maximum benefit of doubt by assuming that it does exist and evaluating its fit with the 1979 data purely on probabilistic merits.
        
        (ETA: what I’m saying is, forget the simulation, on which I’m willing to cop to charges of “intellectual masturbation”. Instead, focus on the basic intuition. If I’m wrong about that, then I’m wrong enough that I’m looking forward to having learned something important.)
        
        (ETA2: the fine print on the chi-square test reads “for the chi-square approximation to be valid, the expected frequency should be at least 5”—so in this case the test may not apply.)
        gwern 12 Apr 2013 18:17 UTC
        0 points
        0
        Parent
        
        Do you disagree that the presence in a small sample of two instances of very rare species constitutes strong prima facie evidence against the “coincidence” hypothesis?
        
        Why is coincidence a live hypothesis here? Surely we might expect there to be some connection—the numbers are ostensibly about the same government in the same country in different time periods. Another example of what I mean by you are making a ton of assumptions and you have not defined what parameters or distributions or sets of models you are working with. This is simply not a well-defined problem so far.
        
        I didn’t understand your earlier example using chi-squared, which is what I take you to mean by “already pointed out”. You made up some data, and “proved” that chi-squared failed to reject the null when you asked it about the made-up data. You assumed a sample size of 100, when the implausibility of the coincidence hypothesis comes precisely from the much smaller sample size (plus the existence of “rare” categories and the overall number of categories).
        
        And as I mentioned, I could do no other because the percentages simply cannot work as frequencies appropriate for any discrete tests with a specific sample of 9. I had to inflate to a sample size of 100 so I could interpret something like 2% as meaning anything at all.
        Morendil 12 Apr 2013 18:25 UTC
        0 points
        0
        Parent
        
        Why is coincidence a live hypothesis here?
        
        What I mean by “coincidence” is “the 1979 data was obtained by picking at random from the same kind of population as the 1995 data, and the close fit of numbers results from nothing more sinister than a honest sampling procedure”.
        
        You still haven’t answered a direct question I’ve asked three times—I wish you would shit or get off the pot.
        
        (ETA: the 1979 document actually says that the selection wasn’t random: “We identified and analyzed nine cases where software development was contracted for with Federal funds. Some were brought to our attention because they were problem cases.”—so that sample would have been biased toward projects turned “bad”. But this is one of the complications I’m choosing to ignore, because it weighs on the side where my priors already lie—that the 1995 frequencies can’t possibly match the 1979 that closely without the latter being a textual copy of the earlier. I’m trying to be careful that all the assumptions I make, when I find I have to make them, work against the conclusion I suspect is true.)
        gwern 12 Apr 2013 18:58 UTC
        0 points
        0
        Parent
        
        What I mean by “coincidence” is “the 1979 data was obtained by picking at random from the same kind of population as the 1995 data,
        
        What population is that?
        
        You still haven’t answered a direct question I’ve asked three times—I wish you would shit or get off the pot.
        
        You are not asking meaningful questions, you are not setting up your assumptions clearly. You are asking me, directly, “Is bleen more furfle than blaz, if we assume that quux>baz with a standard deviation of approximately quark and also I haven’t mentioned other assumptions I have made?” Well, I can answer that quite easily: I have no fucking idea, but good luck finding an answer.
        
        While we are complaining about not answering, you have not answered my questions about coin flipping or about lotteries.
        Expand this thread
        Morendil 12 Apr 2013 20:59 UTC
        0 points
        0
        Parent
        
        you have not answered my questions about coin flipping or about lotteries.
        
        (You didn’t ask a question about coin flipping. The one about lotteries I answered: “I don’t know what you mean”. Just tying up any loose ends that might be interpreted as logical rudeness.)
        Morendil 12 Apr 2013 19:14 UTC
        0 points
        0
        Parent
        
        What population is that?
        
        Answered already—if the 1995 data set exists, then it pretty much has to be a survey of the entire spend of the US Department of Defense on software projects; a census, if you will. (Whether that is plausible or not is a separate question.)
        
        You are not asking meaningful questions
        
        Okay, let me try another one then. Suppose we entered this one into PredictionBook: “At some point before 2020, someone will turn up evidence such as a full-text paper, indicating that the 1995 Jarzombek data set exists, was collected independently of the 1979 GAO data set, and independently found the same frequencies.”
        
        What probability would you assign to that statement?
        
        I’m not trying to set up any assumptions, I’m just trying to assess how plausible the claim is that the 1995 data set genuinely exists, as opposed to its being a memetic copy of the 1979 study. (Independently even of whether this was fraud, plagiarism, a honest mistake, or whatever.)
        gwern 12 Apr 2013 20:40 UTC
        0 points
        0
        Parent
        
        What probability would you assign to that statement?
        
        Very low. You’re the only one that cares, and government archives are vast. I’ve failed to find versions of many papers and citations I’d like to have in the past.
        Kindly 12 Apr 2013 18:55 UTC
        0 points
        0
        Parent
        
        Intuitively, the existence of categories at 2% and 3% make the conclusion clear. If the 1995 data isn’t made up, then it is very rare that a project falls into one of these categories at all—respectively ¹⁄₅₀ and ¹⁄₃₀ chances. So the chance that our small sample of 9 projects happens to contain one each of these kinds of projects is very small to start with, about ⁹⁄₁₅₀.
        
        Given that we know nothing about how the projects themselves were distributed between the categories, we can’t actually say this with any confidence. It’s possible, for example, that the 2% category actually receives many projects on average, but they’re all cheap.
        
        If you assume that the project costs are normally distributed, then that assumption makes the 1979 data inherently unlikely, no matter how close the percentages are to 1995: the existence of a category receiving 2% of the funding means that at best you have a data point which is only 18% of the mean (and another point at 27%). That just doesn’t happen for normal distributions (unless the variance is so large that the model becomes ridiculous anyway, due to the huge probability of it giving you negative numbers).
        Morendil 13 Apr 2013 10:54 UTC
        0 points
        0
        Parent
        It’s actually quite plausible that cheaper projects have a greater chance of falling into the rare category of successful projects, as the original 1979 defined success—“used without extensive rework”. It’s also quite possible that project size isn’t normally distributed.
        
        What I seem to have trouble conveying is my intuition that the fit is too close to be true—that in general if you have a multinomial distribution with five categories, and you draw a small sample from that distribution, it is quite unlikely that your sample frequencies will come within 1% of the true probabilities.
        
        The chi-squared test, if I’ve understood it correctly, computes the converse probability—the probability that your sample contains frequencies that are this far removed or more from the true probabilities, given the assumption that it’s drawn from a distribution with those probabilities. In the case that concerns me the chi-square is obviously very small, so that the p-value approaches unity.
        
        What I’m saying—and it may be a crazy thing to say—is that it’s precisely this small distance from the true probabilities that makes me suspicious.
        Kindly 13 Apr 2013 13:17 UTC
        2 points
        0
        Parent
        I realize what you’re getting at, and it is suspicious, I’m just saying that the probabilities you’re trying to calculate for it aren’t correct.
        
        I’m also not sure what your alternate hypotheses are. There’s no way that the 1979 data was fabricated to fit the 1995 percentages, is there? So any argument that casts doubt on the 1979 data being possible to begin with is going to penalize all possible alternate hypotheses. That’s the problem with the normality assumption: assuming a normal distribution with any true mean makes the 1979 data unlikely, whether or not the percentages are suspiciously close.
        Morendil 15 Apr 2013 15:56 UTC
        0 points
        0
        Parent
        I’ve just come across a more technical explanation than usual of “The Mendel-Fisher Controversy” which frames it as having been about formalizing an intuition of data “too good to be true” using chi-squared.
        
        It is less well known, however, that in 1936, the great British statistician and biologist R. A. Fisher analyzed Mendel’s data and found that the fit to Mendel’s theoretical expectations was too good (Fisher 1936). Using χ2 analysis, Fisher found that the probability of obtaining a fit as good as Mendel’s was only 7 in 100,000. (source)
        
        And this PDF or this page say pretty much the same.
        
        Incidentally a very high P (>0.9) is suspicious, as it means that the results are just too good to be true! This suggests that there is some bias in the experiment, whether deliberate or accidental.
        
        So, ISTM, gwern’s analysis here leads to the “too good to be true” conclusion.
        Morendil 13 Apr 2013 14:35 UTC
        0 points
        0
        Parent
        
        There’s no way that the 1979 data was fabricated to fit the 1995 percentages, is there?
        
        No, I’m quite confident the 1979 document is genuine (call it 100% minus a hair). Just what the data represents is something else again—by the authors’ own admission they worked with a biased sample.
        
        The 1995 sample, assuming it is genuine, is quite unbiased—since it is (claimed to be) the entire population.
        
        I’m also not sure what your alternate hypotheses are.
        
        To me it seems quite likely that the 1995 “results” are artifactual: my main theory is that someone heard an oral presentation from the person cited as the author, conflated that presentation in their mind with the 1979 data, and a few years later presented a chimera of the two, attributing it to the speaker. Later authors just copied and pasted the claim and reference, neglecting to fact-check it.
        
        the probabilities you’re trying to calculate for it aren’t correct
        
        I’m willing to accept that. But if we agree that the close fit is suspicious, then I would hazard that we have some mathematical background for that intuition, and if so there must be at least some way of formalizing that intuition which is better than saying “I just don’t know”.
        
        Conversely, if that intuition is in fact ungrounded (perhaps for the same reason we call “too improbable to be a coincidence” a winning lottery draw which pattern-matches something significant to us, like a birth date), there should be a way of formalizing that.
        Morendil 11 Apr 2013 5:23 UTC
        0 points
        0
        Parent
        So you wouldn’t be surprised by my hypothetical scenario, where a family of 9 is claimed to poll exactly the same as the results in a national election?
        gwern 11 Apr 2013 16:40 UTC
        2 points
        0
        Parent
        No, I would be surprised, but that is due to my background knowledge that a family unit implies all sorts of mutual correlations, ranging from growing up (if one’s parents are Republicans, one is almost surely a Republican as well) to location (most states are not equally split ideologically), and worries about biases and manipulations and selection effects (“This Iowa district voted for the winning candidate in the last 7 elections!”).
        
        On the other hand, if you simply told me that 9 random people split 5-4 for Obama, I would simply shrug and say, “Well, yeah. Obama had the majority, and in a sample of 9 people, a 5-4 split for him is literally the single most likely outcome possible—every other split like 9-0 is further removed from the true underlying probability that ~52% of people voted for him. It’s not all that likely, but you could say that about every lottery winner or every single sequence you get when flipping a fair coin n times: each possible winner had just a one in millions chance of winning, or each sequence had a 0.5^n chance of happening. But, something had to happen, someone had to win the lottery, some sequence had to be produced by the final coin flip.”
- Morendil 10 Apr 2013 20:56 UTC
  0 points
  0
  Parent
  
  I’m not sure what exactly you’re doing.
  
  I think I’ve just spotted at least one serious mistake, so give me some time to clean this up. Probably I can do the same thing in R.