The Universal Medical Journal Article Error
TL;DR: When people read a journal article that concludes, “We have proved that it is not the case that for every X, P(X)”, they generally credit the article with having provided at least weak evidence in favor of the proposition ∀x !P(x). This is not necessarily so.
Authors using statistical tests are making precise claims, which must be quantified correctly. Pretending that all quantifiers are universal because we are speaking English is one error. It is not, as many commenters are claiming, a small error. ∀x !P(x) is very different from !∀x P(x).
A more-subtle problem is that when an article uses an F-test on a hypothesis, it is possible (and common) to fail the F-test for P(x) with data that supports the hypothesis P(x). The 95% confidence level was chosen for the F-test in order to count false positives as much more expensive than false negatives. Applying it therefore removes us from the world of Bayesian logic. You cannot interpret the failure of an F-test for P(x) as being even weak evidence for not P(x).
I used to teach logic to undergraduates, and they regularly made the same simple mistake with logical quantifiers. Take the statement “For every X there is some Y such that P(X,Y)” and represent it symbolically:
∀x∃y P(x,y)
Now negate it:
!∀x∃y P(x,y)
You often don’t want a negation to be outside quantifiers. My undergraduates would often just push it inside, like this:
∀x∃y !P(x,y)
If you could just move the negation inward like that, then these claims would mean the same thing:
A) Not everything is a raven: !∀x raven(x)
B) Everything is not a raven: ∀x !raven(x)
To move a negation inside quantifiers, flip each quantifier that you move it past.
!∀x∃y P(x,y) = ∃x!∃y P(x,y) = ∃x∀y !P(x,y)
Here’s the findings of a 1982 article [1] from JAMA Psychiatry (formerly Archives of General Psychiatry), back in the days when the medical establishment was busy denouncing the Feingold diet:
Previous studies have not conclusively demonstrated behavioral effects of artificial food colorings … This study, which was designed to maximize the likelihood of detecting a dietary effect, found none.
Now pay attention; this is the part everyone gets wrong, including most of the commenters below.
The methodology used in this study, and in most studies, is as follows:
Divide subjects into a test group and a control group.
Administer the intervention to the test group, and a placebo to the control group.
Take some measurement that is supposed to reveal the effect they are looking for.
Compute the mean and standard deviation of that measure for the test and control groups.
Do either a t-test or an F-test of the hypothesis that the intervention causes a statistically-significant effect on all subjects.
If the test succeeds, conclude that the intervention causes a statistically-significant effect (CORRECT).
If the test does not succeed, conclude that the intervention does not cause any effect to any subjects (ERROR).
People make the error because they forget to explicitly state what quantifiers they’re using. Both the t-test and the F-test work by assuming that every subject has the same response function to the intervention:
response = effect + normally distributed error
where the effect is the same for every subject. If you don’t understand why that is so, read the articles about the t-test and the F-test. The null hypothesis is that the responses of all subjects in both groups were drawn from the same distribution. The one-tailed versions of the tests take a confidence level C and compute a cutoff Z such that, if the null hypothesis is false,
P(average effect(test) - average effect(control)) < Z = C
ADDED: People are making comments proving they don’t understand how the F-test works. This is how it works: You are testing the hypothesis that two groups respond differently to food dye.
Suppose you measured the number of times a kid shouted or jumped, and you found that kids fed food dye shouted or jumped an average of 20 times per hour, and kids not fed food dye shouted or jumped an average of 17 times per hour. When you run your F-test, you compute that, assuming all kids respond to food dye the same way, you need a difference of 4 to conclude with 95% confidence that the two distributions (test and control) are different.
If the food dye kids had shouted/jumped 21 times per hour, the study would conclude that food dye causes hyperactivity. Because they shouted/jumped only 20 times per hour, it failed to prove that food dye affects hyperactivity. You can only conclude that food dye affects behavior with 84% confidence, rather than the 95% you desired.
Finding that food dye affects behavior with 84% confidence should not be presented as proof that food dye does not affect behavior!
If half your subjects have a genetic background that makes them resistant to the effect, the threshold for the t-test or F-test will be much too high to detect that. If 10% of kids become more hyperactive and 10% become less hyperactive after eating food coloring, such a methodology will never, ever detect it. A test done in this way can only accept or reject the hypothesis that for every subject x, the effect of the intervention is different than the effect of the placebo.
So. Rephrased to say precisely what the study found:
This study tested and rejected the hypothesis that artificial food coloring affects behavior in all children.
Converted to logic (ignoring time):
!( ∀child ( eats(child, coloring) ⇨ behaviorChange(child) ) )
Move the negation inside the quantifier:
∃child !( eats(child, coloring) ⇨ behaviorChange(child) )
Translated back into English, this study proved:
There exist children for whom artificial food coloring does not affect behavior.
However, this is the actual final sentence of that paper:
The results of this study indicate that artificial food colorings do not affect the behavior of school-age children who are claimed to be sensitive to these agents.
Translated into logic:
!∃child ( eats(child, coloring) ⇨ hyperactive(child) ) )
or, equivalently,
∀child !( eats(child, coloring) ⇨ hyperactive(child) ) )
This refereed medical journal article, like many others, made the same mistake as my undergraduate logic students, moving the negation across the quantifier without changing the quantifier. I cannot recall ever seeing a medical journal article prove a negation and not make this mistake when stating its conclusions.
A lot of people are complaining that I should just interpret their statement as meaning “Food colorings do not affect the behavior of MOST school-age children.”
But they didn’t prove that food colorings do not affect the behavior of most school-age children. They proved that there exists at least one child whose behavior food coloring does not affect. That isn’t remotely close to what they have claimed.
For the record, the conclusion is wrong. Studies that did not assume that all children were identical, such as studies that used each child as his or her own control by randomly giving them cookies containing or not containing food dye [2], or a recent study that partitioned the children according to single-nucleotide polymorphisms (SNPs) in genes related to food metabolism [3], found large, significant effects in some children or some genetically-defined groups of children. Unfortunately, reviews failed to distinguish the logically sound from the logically unsound articles, and the medical community insisted that food dyes had no influence on behavior until thirty years after their influence had been repeatedly proven.
[1] Jeffrey A. Mattes & Rachel Gittelman (1981). Effects of Artificial Food Colorings in Children With Hyperactive Symptoms: A Critical Review and Results of a Controlled Study. Archives of General Psychiatry 38(6):714-718. doi:10.1001/archpsyc.1981.01780310114012.
[2] K.S. Rowe & K.J. Rowe (1994). Synthetic food coloring and behavior: a dose response effect in a double-blind, placebo-controlled, repeated-measures study. The Journal of Pediatrics Nov;125(5 Pt 1):691-8.
[3] Stevenson, Sonuga-Barke, McCann et al. (2010). The Role of Histamine Degradation Gene Polymorphisms in Moderating the Effects of Food Additives on Children’s ADHD Symptoms. Am J Psychiatry 167:1108-1115.
- 6 Apr 2013 13:47 UTC; -2 points) 's comment on How do you interpret your “% positive”? by (
- 7 Jun 2014 21:51 UTC; -2 points) 's comment on [Meta] The Decline of Discussion: Now With Charts! by (
You claim that medical researchers are doing logical inference incorrectly. But they are in fact doing statistical inference and arguing inductively.
Statistical inference and inductive arguments belong in a Bayesian framework. You are making a straw man by translating them into a deductive framework.
No. Mattes and Gittelman’s finding is stronger than your rephrasing—your rephrasing omits evidence useful for Bayesian reasoners. For instance, they repeatedly pointed out that they “[studied] only children who were already on the Feingold diet and who were reported by their parents to respond markedly to artificial food colorings.” They claim that this is important because “the Feingold diet hypothesis did not originate from observations of carefully diagnosed children but from anecdotal reports on children similar to the ones we studied.” In other words, they are making an inductive argument:
Most evidence for the Feingold diet hypothesis comes from anecdotal reports.
Most of these anecdotal reports are mistaken.
Thus, there is little evidence for the Feingold diet hypothesis.
Therefore, the Feingold diet hypothesis is wrong.
If you translate this into a deductive framework, of course it will not work. Their paper should be seen in a Bayesian framework, and in this context, their final sentence
translates into a correct statement about the evidence resulting from their study.
They are not making this mistake. You are looking at a straw man.
Full-texts:
Mattes and Gittelman (1981)
Rowe and Rowe (1994)
Stevenson et al. (2010)
The number of upvotes for the OP is depressing.
It’s a good example for forcing your toolset into every situation you encounter. If all you have is a hammer …
Don’t worry, we’ll have Metamed to save us!
Well, dammit, I wanted to delete this and rewrite above, but you can’t delete comments anymore. This is not retracted, but I can’t un-retract it.
You are wrong, and you have not learned to reconsider your logic when many smart people disagree with you.
You can delete retracted comments by reloading the page and clicking on a new delete icon that replaces the retract icon.
Only if no-one’s replied to them.
I’m not sure that’s true. See here
People can reply to any comments that they still see in their browser page, even though they’ve been “deleted”, if the replier has not refreshed said browser page.
EDIT TO ADD: As I see that wedrifid also mentions below.
Possibly there is also a similar effect if the deleter hasn’t refreshed his browser page.
Possibly. Specifically it would be if you (as the example) had retracted the page then refreshed it (to get the ‘delete’ button available to you) and then there is an arbitrary period of time after which you click the delete button without first refreshing again. (Untested, but those are the circumstances under which it would be at all possible if the code is not specifically designed to prevent it.)
Why are you not sure of facts that are subject to easy experiments? (update: arundelo is correct)
Experiment clutters the venue, and being less blunt avoids the appearance of a status conflict.
If deletion is possible, there is very little clutter. If deletion is not possible, and the comment says “I can’t figure out how to delete this,” at least it discourages other people’s experiments. But this thread is itself clutter, so I don’t think that is your true rejection. As to bluntness, I conclude that my being less blunt caused you to confabulate bullshit.
PS—I experiment on the open thread.
On reflection, it is probably more accurate for me to say, “I wasn’t interested in experimenting, including for concern that the experimenting would look low status, and I have higher preferred ways of acting low status.”
As for my own choice not to be blunt, you are not correctly modelling my thought process.
In short, I gave two reasons for my action, and you might be right that one was confabulation, but not the one you identify as confabulation.
I have performed the experiment in question and it seems to support arundelo’s claim. I am not able to remove this comment. At the very least it demonstrates that the experiment required to prove arundelo’s fully general claim is false is not the ‘easy’ one.
Well, now I’m totally confused. Checking Eugine_Nier’s account on ibiblio shows that the comment is missing. (Searching for the word “sarcasm” will get you to about when the comment took place, at least as of the date of this comment)
See my investigation. Short answer: race condition.
Thanks actually experimenting. My beliefs were two months out of date. I stand by my objection to Tim’s comment.
It is possible that the comment was banned by a moderator rather than deleted by the author. (If so, it will still appear if you look at the user’s comment page.)
After retraction EDIT: TimS. I can’t seem to delete this comment even after refreshing.
As it happens, I remember what Eugine_Nier wrote, and I am certain it did not meet the local criteria for mod-blocking.
(Anonymous downvoter: What is it in wedrifid’s post you’d like to see less of? Helpful commentary about the mechanics of this site is not on my list of things to downvote).
Interesting. This suggests that a feature has changed at some point since the retraction-then-delete feature was first implemented. (I have memories of needing to be careful to edit the text to blank then retract so as to best emulate the missing ‘delete’ feature.)
I notice that I am confused. Investigates.
Testing deletion feature. Deletion of (grandparent) comment that you have already replied to: Fail. It is still not (usually) possible to delete comments with replies.
Check for moderator deletion. (ie. Moderator use of the ban feature, actual delete per se is extremely rare). Confirm absence of a reply on Eugine_Nier’s page that fits that part of history. The comment is, indeed, deleted not banned.
Check timestamps for plausibility of race condition. Ahh. Yes. Tim, you replied to Eugine within 3 minutes of him writing the comment. This means that most likely Eugine deleted his message while you were writing your reply. Your comment was still permitted to be made despite the deleted parent. The reverse order may also be possible, depending on the details of implementation. Either way, the principle is the same.
ArisKatsaris suggests browser refresh, not timestamps, is the issue.
He is describing the same phenomenon. The timestamps give an indication as to how likely the race condition is to occur based on the delays between GETs and POSTs. If the comments were a day apart I would have tentatively suggested “Perhaps one of you deleted or replied to a comments page that was old?”. Whereas given that the timestamps were within 3 minutes I could more or less definitively declare the question solved.
Thanks. I’m not technologically fluent enough to tell the difference between what you said and what he said without the explanation.
For the record, I did in fact delete the comment.
Jaynes argued that probability theory was an extension of logic, so this seems like quite a quibbling point.
They do, but did the paper he dealt with write within a Bayesian framework? I didn’t read it, but it sounded like standard “let’s test a null hypothesis” fare.
Which is not a valid objection to Phil’s analysis if Mattes and Gittelman weren’t doing a Bayesian analysis in the first place. Were they? I’ll apologize for not checking myself if I’m wrong, but right now my priors are extremely low so I don’t see value in expending the effort to verify.
If they did their calculations in a Bayesian framework. Did they?
You don’t just ignore evidence because someone used a hypothesis test instead of your favorite Bayesian method. P(null | p value) != P(null)
I ignore evidence when the evidence doesn’t relate to the point of contention.
Phil criticized a bit of paper, noting that the statistical analysis involved did not justify the conclusion made. The conclusion did not follow the analysis. Phil was correct in that criticism.
It’s just not an argument against Phil that someone might take some of the data in the paper and do a Bayesian analysis that the authors did not do.
That’s not what I’m saying. I’m saying that what the authors did do IS evidence against the hypothesis in question. Evidence against a homogenous response is evidence against any response (it makes some response less likely)
What they did do?
Are you saying the measurements they took make their final claim more likely, or that their analysis of the data is correct and justifies their claim?
Yes, if you arrange things moderately rationally, evidence against a homogenous response is evidence against any response, but much less so. I think Phil agrees with that too, and is objecting to a conclusion based on much less so evidence pretending to have much more justification than it does.
Ok, yeah, translating what the researchers did into a Bayesian framework isn’t quite right either. Phil should have translated what they did into a frequentist framework—i.e. he still straw manned them. See my comment here.
I know that. That’s not the point. They claimed to have proven something they did not prove. They did not present this claim in a Bayesian framework.
No. I am not attacking the inductive argument in your points 1-4 above, which is not made in the paper, is not the basis for their claims, and is not what I am talking about.
You speak of the evidence from their study, but apparently you have not looked at the evidence from their study, presented in table 3. If you looked at the evidence you would see that they have a large number of measures of “hyperactivity”, and that they differed between test and control groups. They did not find that there was no difference between the groups. There is always a difference between the groups.
What they did, then, was do an F-test to determine whether the difference was statistically significant, using the assumption that all subjects respond the same way to the intervention. They make that assumption, come up with an F-value, and say, “We did not reach this particular F-value, therefore we did not prove the hypothesis that food dye causes hyperactivity.”
THEY DID NOT ASK WHETHER FOOD DYE INCREASED OR DECREASED HYPERACTIVITY BETWEEN THE GROUPS. That is not how an F-test works. They were, strictly speaking, testing the hypothesis whether the two groups differed, not in which direction they differed.
THERE WAS NO EVIDENCE THAT FOOD DYE DOES NOT CAUSE HYPERACTIVITY IN THEIR DATA. Not even interpreted in a Bayesian framework. They found a difference in behavior, they computed an F-value for 95% confidence assuming population homogeneity, and they did not reach that F-value.
Go back and read the part I added, with the bulleted list. You are trying to get all subtle. No; these people did an F-test, which gave a result of the form “It is not the case that for all x, P(x)”, and they interpreted that as meaning “For all x, it is not the case that P(x).”
I don’t think you responded to my criticisms and I have nothing further to add. However, there are a few critical mistakes in what you have added that you need to correct:
No, Mattes and Gittelman ran an order-randomized crossover study. In crossover studies, subjects serve as their own controls and they are not partitioned into test and control groups.
No, the correct form is:
The tests compute a difference in magnitude of response such that if the null hypothesis is true, then 95% of the time the measured effect is not that large.
The form you quoted is a deadly undergraduate mistake.
This is wrong. There are reasonable prior distributions for which the observation of a small positive sample difference is evidence for a non-positive population difference. For example, this happens when the prior distribution for the population difference can be roughly factored into a null hypothesis and an alternative hypothesis that predicts a very large positive difference.
In particular, contrary to your claim, the small increase of 3 can be evidence that food dye does not cause hyperactivity if the prior distribution can be factored into a null hypothesis and an alternative hypothesis that predicts a positive response much greater than 3. This is analogous to one of Mattes and Gittelman’s central claims (they claim to have studied children for which the alternative hypothesis predicted a very large response).
I read through most of the comments and was surprised that so little was made of this. Thanks, VincentYu. For anyone who could use a more general wording, it’s the difference between:
P(E≥S|H) the probability P of the evidence E being at least as extreme as test statistic S assuming the hypothesis H is true, and
P(H|E) the probability P of the hypothesis H being true given the evidence E.
This is going to be yet another horrible post. I just go meta and personal. Sorry.
I don’t understand how this thread (and a few others like it) on stats can happen; in particular, your second point (re: the basic mistake). It is the single solitary thing any person who knows any stats at all knows. Am I wrong? Maybe ‘knows’ meaning ‘understands’. I seem to recall the same error made by Gwern (and pointed out). I mean the system works in the sense that these comments get upvoted, but it is like. . . people having strong technical opinions with very high confidence about Shakespeare without being able to write out a sentence. It is not inconceivable the opinions are good (stroke, language, etc), but it says something very odd about the community that it happens regularly and is not extremely noticed. My impression is that Less Wrong is insane on statistics, particularly, and some areas of physics (and social aspects of science and philosophy).
I didn’t read the original post, paper, or anything other than some comment by Goetz which seemed to show he didn’t know what a p-value was and had a gigantic mouth. It’s possible I’ve missed something basic. Normally, before concluding a madness in the world, I’d be careful. For me to be right here means madness is very very likely (e.g., if I correctly guess it’s −70 outside without checking any data, I know something unusual about where I live).
Many people with statistics degrees or statisticians or statistics professors make the p-value fallacy; so perhaps your standards are too high if LWers merely being as good as statistics professors comes as a disappointment to you.
I’ve pointed out the mis-interpretation of p-values many times (most recently, by Yvain), and wrote a post with the commonness of the misinterpretation as a major point (http://lesswrong.com/lw/g13/against_nhst/), so I would be a little surprised if I have made that error.
Sorry, Gwern, I may be slandering you, but I thought I noticed it long before that (I’ve been reading, despite my silence). Another thing I have accused you of, in my head, is a failure to appropriately apply a multiple test correction when doing some data exploration for trends in the less wrong survey. Again, I may have you misidentified. Such behavior is striking, if true, since it seems to me one of the most basic complaints Less Wrong has about science (somewhat incorrectly).
Edited: Gwern is right (on my misremembering). Either I was skimming and didn’t notice Gwern was quoting or I just mixed corrector with corrected. Sorry about that. In possible recompense: What I would recommend you do for data exploration is decide ahead of time if you have some particularly interesting hypothesis or not. If not and you’re just going to check lots of stuff, then commit to that and the appropriate multiple test correction at the end. That level of correction then also saves your ‘noticing’ something interesting and checking it specifically being circular (because you were already checking ‘everything’ and correcting appropriately).
It’s true I didn’t do any multiple correction for the 2012 survey, but I think you’re simply not understanding the point of multiple correction.
First, ‘Data exploration’ is precisely when you don’t want to do multiple correction, because when data exploration is being done properly, it’s being done as exploration, to guide future work, to discern what signals may be there for followup. But multiple correction controls the false positive rate at the expense of then producing tons of false negatives; this is not a trade-off we want to make in exploration. If you look at the comments, dozens of different scenarios and ideas are being looked at, and so we know in advance that any multiple correction is going to trash pretty much every single result, and so we won’t wind up with any interesting hypotheses at all! Predictably defeating the entire purpose of looking. Why would you do this wittingly? It’s one thing to explore data and find no interesting relationships at all (shit happens), but it’s another thing entirely to set up procedures which nearly guarantee that you’ll ignore any relationships you do find. And which multiple correction, anyway? I didn’t come up with a list of hypotheses and then methodically go through them, I tested things as people suggested them or I thought of them; should I have done a single multiple correction of them all yesterday? (But what if I think of a new hypothesis tomorrow...?)
Second, thresholds for alpha and beta are supposed to be set by decision-theoretic considerations of cost-benefit. A false positive in medicine can be very expensive in lives and money, and hence any exploratory attitude, or undeclared data mining/dredging, is a serious issue (and one I fully agree with Ioannides on). In those scenarios, we certainly do want to reduce the false positives even if we’re forced to increase the false negatives. But this is just an online survey. It’s done for personal interest, kicks, and maybe a bit of planning or coordination by LWers. It’s also a little useful for rebutting outside stereotypes about intellectual monoculture or homogeneity. In this context, a false positive is not a big deal, and no worse than a false negative. (In fact, rather than sacrifice a disproportionate amount of beta in order to decrease alpha more, we might want to actually increase our alpha!)
This cost-benefit is a major reason why if you look through my own statistical analyses and experiments, I tend to only do multiple correction in cases where I’ve pre-specified my metrics (self-experiments are not data exploration!) and where a false positive is expensive (literally, in the case of supplements, since they cost a non-trivial amount of $ over a lifetime). So in my Zeo experiments, you will see me use multiple correction for melatonin, standing, & 2 Vitamin D experiments (and also in a recent non-public self-experiment); but you won’t see any multiple correction in my exploratory weather analysis.
See above on why this is pointless and inappropriate.
If you were doing it at the end, then this sort of ‘double-testing’ would be a concern as it might lead your “actual” number of tests to differ from your “corrected against” number of tests. But it’s not circular, because you’re not doing multiple correction. The positives you get after running a bunch of tests will not have a very high level of confidence, but that’s why you then take them as your new fixed set of specific hypotheses to run against the next dataset and, if the results are important, then perhaps do multiple correction.
So for example, if I cared that much about the LW survey results from the data exploration, what I should ideally do is collect the n positive results I care about, announce in advance the exact analysis I plan to do with the 2013 dataset, and decide in advance whether and what kind of multiple correction I want to do. The 2012 results using 2012 data suggest n hypotheses, and I would then actually test them with the 2013 data. (As it happens, I don’t care enough, so I haven’t.)
Gwern, I should be able to say that I appreciate the time you took to respond (which is snarky enough), but I am not able to do so. You can’t trust that your response to me is inappropriate and I can’t find any reason to invest myself in proving your response is inappropriate. I’ll agree my comment to you was somewhat inappropriate and while turnabout is fair play (and first provocation warrants an added response), it is not helpful here (whether deliberate or not). Separate from that, I disagree with you (your response is,historically, how people have managed to be wrong a lot). I’ll retire once more.
I believe it was suggested to me when I first asked the potential value of this place that they could help me with my math.
Nope, I don’t think you have. Not everyone is crazy, but scholarship is pretty atrocious.
I think you’re interpreting the F test a little more strictly than you should. Isn’t it fairer to say a null result on a F test is “It is not the case that for most x, P(x)”, with “most” defined in a particular way?
You’re correct that a F-test is miserable at separating out different classes of responders. (In fact, it should be easy to develop a test that does separate out different classes of responders; I’ll have to think about that. Maybe just fit a GMM with three modes in a way that tries to maximize the distance between the modes?)
But I think the detail that you suppressed for brevity also makes a significant difference in how the results are interpreted. This paper doesn’t make the mistake of saying “artificial food coloring does not cause hyperactivity in every child, therefore artificial food coloring affects no children.” The paper says “artificial food coloring does not cause hyperactivity in every child whose parents confidently expect them to respond negatively to artificial food coloring, therefore their parents’ expectation is mistaken at the 95% confidence level.”
Now, it could be the case that there are children who do respond negatively to artificial food coloring, but the Feingold association is terrible at finding them / rejecting those children where it doesn’t have an effect. (This is unsurprising from a Hawthorne Effect or confirmation bias perspective.) As well, for small sample sizes, it seems better to use F and t tests than to try to separate out the various classes of responders, because the class sizes will be tiny; if one child responds poorly after administered artificial food die, that’s not much to go on, compared to a distinct subpopulation of 20 children in a sample of 1000.
The section of the paper where they describe their reference class:
(I should add that the first sentence is especially worth contemplating, here.)
I think I disagree with both of you here. The failure to reject a null hypothesis is a failure. It doesn’t allow or even encourage you to conclude anything.
Can you conclude that you failed to reject the null hypothesis? And if you expected to reject the null hypothesis, isn’t that failure meaningful? (Note that my language carefully included the confidence value.)
As a general comment, this is why the Bayesian approach is much more amenable to knowledge-generation than the frequentist approach. The statement “the hyperactivity increase in the experimental group was 0.36+/-2.00, and that range solidly includes 0” (with the variance of that estimate pulled out of thin air) is much more meaningful than “we can’t be sure it’s not zero.”
I agree with the second sentence, and the first might be true, but the second isn’t evidence for the first; interval estimation vs. hypothesis testing is an independent issue to Bayesianism vs. frequentism. There are Bayesian hypothesis tests and frequentist interval estimates.
Agreed that both have those tools, and rereading my comment I think “approach” may have been a more precise word than “statistics.” If you think in terms of “my results are certain, reality is uncertain” then the first tool you reach for is “let’s make an interval estimate / put a distribution on reality,” whereas if you think in terms of “reality is certain, my results are uncertain” then the first tool you reach for is hypothesis testing. Such defaults have very important effects on what actually gets used in studies.
To me, but not to the theoretical foundations of the method employed.
Hypothesis testing generally works sensibly because people smuggle in intuitions that aren’t part of the foundations of the method. But since they’re only smuggling things in under a deficient theoretical framework, they’re given to mistakes, particularly when they’re applying their intuitions to the theoretical framework and not the base data.
I agree with the later comment on Bayesian statistics, and I’d go further. Scatterplot the labeled data, or show the distribution if you have tons of data. That’s generally much more productive than any particular particular confidence interval you might construct.
It would be an interesting study generative study to compare the various statistical tests on the same hypothesis versus the human eyeball. I think the eyeball will hold it’s own.
[1] Jeffrey A. Mattes & Rachel Gittelman (1981). Effects of Artificial Food Colorings in Children With Hyperactive Symptoms: A Critical Review and Results of a Controlled Study. Archives of General Psychiatry 38(6):714-718. doi:10.1001/archpsyc.1981.01780310114012. ungated
[2] K.S. Rowe & K.J. Rowe (1994). Synthetic food coloring and behavior: a dose response effect in a double-blind, placebo-controlled, repeated-measures study. The Journal of Pediatrics Nov;125(5 Pt 1):691-8. ungated
[3 open access] Stevenson, Sonuga-Barke, McCann et al. (2010). The Role of Histamine Degradation Gene Polymorphisms in Moderating the Effects of Food Additives on Children’s ADHD Symptoms. Am J Psychiatry 167:1108-1115.
I wouldn’t have posted this if I’d noticed earlier links, but independent links are still useful.
The F test / t test doesn’t quite say that. It makes statements about population averages. More specifically, if you’re comparing the mean of two groups, the t or F test says whether the average response of one group is the same as the other group. Heterogeneity just gets captured by the error term. In fact, econometricians define the error term as the difference between the true response and what their model says the mean response is (usually conditional on covariates).
The fact that the authors ignored potential heterogeneity in responses IS a problem for their analysis, but their result is still evidence against heterogeneous responses. If there really are heterogeneous responses we should see that show up in the population average unless:
The positive and negative effects cancel each other out exactly once you average across the population. (this seems very unlikely)
The population average effect size is nonzero but very small, possibly because the effect only occurs in a small subset of the population (even if it’s large when it does occur) or something similar but more complicated. In this case, a large enough sample size would still detect the effect.
Now it might not be very strong evidence—this depends on sample size and the likely nature of the heterogeneity (or confounders, as Cyan mentions). And in general there is merit in your criticism of their conclusions. But I think you’ve unfairly characterized the methods they used.
Why do you say that? Did you look at the data?
They found F values of 0.77, 2.161, and 1.103. That means they found different behavior in the two groups. But those F-values were lower than the thresholds they had computed assuming homogeneity. They therefore said “We have rejected the hypothesis”, and claimed that the evidence, which interpreted in a Bayesian framework might support that hypothesis, refuted it.
I didn’t look at the data. I was commenting on your assessment of what they did, which showed that you didn’t know how the F test works. Your post made it seem as if all they did was run an F test that compared the average response of the control and treatment groups and found no difference.
That’s an uncharitable interpretation of that sentence. It would mean that if there was a word such as “any” before the phrase “school-age children”, but there isn’t. The zero article before plural nouns in English doesn’t generally denote an universal quantifier; “men are taller than women” doesn’t mean ∀x ∈ {men} ∀y ∈ {women} x.height > y.height. The actual meaning of the zero article before plural nouns in English is context-dependent and non-trivial to formalize.
Are you a non-native English speaker by any chance? (So am I FWIW, but the definite article in my native language has a very similar meaning to the zero article in English in contexts like these.)
Suppose there is one school-age child, somewhere in the world, whose behavior is affected by artificial food colorings, and who is claimed to be sensitive to food coloring. Then the statement, “artificial food colorings do not affect the behavior of school-age children who are claimed to be sensitive to these agents,” is false.
You shouldn’t think of this paper as being in English. You should think of it as being written in Science.
It isn’t uncharitable. Even if they had said, “Artificial food colorings do not affect the behavior of MOST school-age children who are claimed to be sensitive to these agents,” it would still be a false claim, unsupported by their data and math. They proved that THERE EXIST children who are not sensitive to these agents. 5% may be enough.
Science != Pure Mathematics.
Yes, you can “prove” very little outside pure mathematics. But “doesn’t prove” doesn’t imply “doesn’t support”. Chapter 1 of Probability Theory by E. T. Jaynes makes that clear.
(And BTW, how comes you’re taking “school-age children” to mean “all school-age children” but you’re not taking “artificial food colorings” to mean ‘all artificial food colorings’?)
No it fucking isn’t. Read the article I’ve linked to again. “Humans have opposable thumbs” doesn’t stop being true as soon as someone somewhere gets both thumbs amputated.
Dude. Let me break it down for you.
Re-read: Suppose there is one school-age child, somewhere in the world, whose behavior is affected by artificial food colorings, and who is claimed to be sensitive to food coloring. Then the statement, “artificial food colorings do not affect the behavior of school-age children who are claimed to be sensitive to these agents,” is false.
Claim A: Artificial food colorings do not affect the behavior of school-age children who are claimed to be sensitive to these agents.
Hypothesized fact B: There is one school-age child, somewhere in the world, whose behavior is affected by artificial food colorings, and who is claimed to be sensitive to food coloring.
You just said that claim A is true given fact B. I don’t need to read the article you linked to to know that’s wrong. We aren’t talking about how people use English colloquially. We are talking about how they use it making logical claims. And I know, from many years of experience, that when a doctor reads a study that concludes “X does not cause Y”, they interpret it as meaning “X never ever ever causes Y”.
Doesn’t matter anyway, since all they proved was “There exists AT LEAST ONE child not affected by food coloring.” That’s not even close to what they claimed.
If whether this particular paper exemplifies this error is disputed (as it appears to be!) and the author’s claim that he “cannot recall ever seeing a medical journal article prove a negation and not make this mistake” is correct, then it should be easy for the author to give several more examples which more clearly display the argument given here. I would encourage PhilGoetz or someone else to do so.
I will not. I have already spent a great deal of time providing everything necessary to understand my point. People incapable of understanding what I’ve already posted, and incapable of admitting uncertainty about it, will not understand it with more examples.
Interesting. Those two statements seem quite different; more than just a rephrasing.
Probabilistically, it sounds like the study found
P(hyper|dye) = P(hyper|~dye)
, that is they rejectedP(hyper|dye) > P(hyper|~dye)
, and concludedP(hyper|dye) = P(hyper|~dye)
(no connection) correctly.I think your logical interpretation of their result throws out most of the information. Yes they concluded that it is not true that all children that ate dye were hyperactive, but they also found that the proportion of dye-eaters who were hyperactive was not different from the base rate, which is a much stronger statement, which does imply their conclusion, but can’t be captured by the logical formulation you gave.
You are making the same mistake by ignoring the quantification. The test used to reject P(hyper|dye) > P(hyper|~dye) uses a cutoff that is set from the sample size using the assumption that all the children have the identical response. They didn’t find P(hyper|dye) = P(hyper|~dye), they rejected the hypothesis that for all children, P(hyper|dye) > P(hyper|~dye), and then inappropriately concluded that for all children, !P(hyper|dye) > P(hyper|~dye).
The whole point of inductive reasoning is that this is evidence for artificial food coloring not affecting the behavior of any children (given a statistically significant sample size). You cannot do purely deductive reasoning about the real world and expect to get anything meaningful. This should be obvious.
They measured a difference between the behavior of the test and the control group. They chose an F-value that this difference would have to surpass in order to prove the proposition that food color affects the behavior of all children. The specific number they chose requires the word “all” there. The differences they found were smaller than the F-value. We don’t know whether the differences were or were not large enough to pass an F-value computed for the proposition that food color affects all but one child, or most children, or one-fifth of all children.
Where, exactly, is the evidence that artificial food color doesn’t affect the behavior of any children?
Then this is what you should have critiqued in your post. Ranting about their inductive reasoning being deductively wrong gets you nowhere.
Since your post is the first time I’ve heard of this: I have no idea, but I assume google has the answer.
Why would I critique them for finding values smaller than the F-value? The values were smaller than the F-value. That means the test failed. What I then critiqued was their logical error in interpreting the test’s failure.
I mean where in the paper. There is no evidence in the paper that artificial food color doesn’t affect the behavior of any children.
Your claim that they are using inductive logic shows that you didn’t understand the paper. Your response that I should have critiqued their not finding a high enough F-value shows you really don’t have the first clue about what an F-test is. Please learn a little about what you’re critiquing before you critique it so confidently in the future.
No, what you (originally) critiqued was the lack of rigorous deductive reasoning in their statistical analysis, as shown by both your introduction and conclusion solely focusing on that. Even whatever point you tried to make about the F-values was lost in a rant about deduction.
In your own words you stated the following:
And that sentence (if true as you claim) is inductive evidence for their conclusion. How many times do I have to tell you this?
All statistical reasoning is inductive reasoning. You claiming the opposite shows that you don’t understand statistics.
Since you completely missed my point, I’ll try again: Focus on critiquing what’s statistically wrong in this paper, not what’s deductively wrong. I simply chose that sentence as it seemed to be the most coherent one in your response.
Now, you seem to be under the assumption that I am defending this paper and it’s conclusion, so let me make it clear that I do not. I have neither read it, nor plan to. I merely found you attacking a statistical analysis for being deductively wrong, and chose to try and help you clear up whatever misunderstanding you had about statistics being a part of deductive reasoning.
I’m guessing you’ve recently started learning about discrete mathematics, and seek to apply your new knowledge on quantifiers to everything you come across. Don’t worry about being wrong, almost everyone goes through such a phase.
So you chose a sentence without understanding it.
There is nothing statistically wrong with the paper.
The error is not in the statistical analysis. The error is in the deductions they made when interpreting it. You are claiming that logic and statistics are like oil and water, and can never co-occur in the same paper. This is incorrect.
As I mentioned in my post, I used to teach logic at a university. So now you have also proved you didn’t read my post. And you have proved you don’t know the difference between logic and discrete math. So know we know that you
don’t know what an F-test is
didn’t read the whole post
don’t know the difference between logic and discrete math
And I also kinda doubt you read the paper. Did you?
I’m sorry that LessWrong has made you stupider, by giving you false confidence to speak with authority on matters you are ignorant of.
As to being wrong, identify an error in anything I’ve said, instead of spouting platitudes about induction without any connection to specific facts and arguments.
And the whole point of science is that it is built on inductive (and not deductive) reasoning.
Well, I’ll give you points for creativity in your straw man, at least.
So, you were a TA then?
No, it merely “proves” that I skimmed the personal biography part of your post in favor of focusing on your actual content.
Please tell me how you “proved” this.
Well, every course in discrete mathematics usually has at least a lesson or two on logic and quantifiers. I just assumed you learned it in such a setting since then you would have had an excuse to not understand it properly (as opposed to having spent an entire course focused solely on logic).
Funny how you use the fact that I skimmed the non-essential parts of your original post as proof that I didn’t read any it, and then goes on to completely ignore what I wrote here:
I also find your usage of the word “proof” very interesting for someone who claims to have taught logic.
Do you always insult people who try to help you? It might help your future discussions if you don’t take criticism so personally.
I have already done so numerous times. Maybe you should try to read my arguments instead of just skimming and assuming?
Identifying an error means taking a specific claim and showing a mistake. Saying “You cannot do purely deductive reasoning about the real world and expect to get anything meaningful” is not identifying an error. Saying “There exists a child for whom X” is inductive proof of “For all children, X” is ridiculous. It gives a tiny tiny bit of support, but not anything anyone would call a proof, any more than “2+2 = 4″ is proof for “For all X, X+2 = 4.” The paper is making errors, but not that one. If you find one child and prove that he’s not affected by food dye, and you write a paper saying “This child’s behavior is not affected by dye, therefore no children’s behavior are affected by dye”, it will not be published. That was not their intent. I doubt anyone has ever published a paper using the “inductive proof” you think is standard.
In light of the fact that you didn’t read the post closely, didn’t read the paper, and don’t understand how an F-test works, you really should stop being so confident. The claims you’re making require you to have done all three. You are claiming that I interpreted their reasoning incorrectly, and you didn’t read it!
It seems you are confused about how statistics work. When you wish to study if group X has property Y, you take a statistically significant sample from group X and see if this sample has property Y. You then use the results from this sample to conclude whether the group as a whole has property Y (with a high or low probability). And this conclusion is always an inductive conclusion, never deductive.
As reported by you in your original post, their sample did not have the property they were looking for and they therefore concluded that the group as a whole does not have this property. You even reported that their statistics was sound. So, where is the error?
Edit to add: In other words, every statistical study ever done has always had a conclusion of the following form:
There exist a statistically significant sample where this property does (not) hold, therefore the property does (not) hold for the whole group.
Which is just the general form of what you critiqued here:
So, by critiquing this study for being deductively wrong, you are in fact critiquing every statistical study ever done for being deductively wrong. Do you now see the problem with this?
Look, what you’ve written above is based on misunderstanding how an F-test works. I’ve already explained repeatedly why what you’re saying here, which is the same thing you’ve said each time before, is not correct.
This study contains a failure of an F-test. Because of how the F-test is structured, failure of an F-test to prove forall X P(X), is not inductive evidence, nor evidence of any kind at all, that P(X) is false for most X.
I will try to be more polite, but you need to a) read the study, and b) learn how an F-test works, before you can talk about this. But I just don’t understand why you keep making confident assertions about a study you haven’t read, using a test you don’t understand.
The F-test is especially tricky, because you know you’re going to find some difference between the groups. What difference D would you expect to find if there is in fact no effect? That’s a really hard question, and the F-test dodges it by using the arbitrary but standard 95% confidence interval to pick a higher threshold, F. Results between D and F would still support the hypothesis that there is an effect, while results below D would be evidence against that hypothesis. Not knowing what D is, we can’t say whether failure of an F-test is evidence for or against a hypothesis.
And I’ve repeatedly told you that you should’ve focused your critique on this instead of ranting about deduction. The last time I said it, you claimed the following:
Now to answer your question:
I haven’t been discussing this study, I’ve been trying to help you understand why your critique of it has been misguided.
As for this claim you undoubtedly have an interesting “proof” for, I’ve simply avoided confusing you further with a discussion of statistics until you realized the following:
All statistical conclusions are deductively wrong.
A statistical study must be critiqued for it’s misuse of statistics (and obviously, then you must first claim that there is something statistically wrong with the paper).
Moved to Discussion. (Again.)
why is it back in main?
I assume Phil reposted it there. Now banning.
Phil says it wasn’t him above. I’d be somewhat surprised if that was a barefaced lie.
Problem already solved. I had noticed this subthread didn’t have acknowledgement of the resolution and considered whether it was necessary for me to post a note saying so. I decided that would be more spammy than helpful so I didn’t. Error!
Thank you!
Wasn’t me.
The problem is that you don’t understand the purpose of the studies at all and you’re violating several important principles which need to be kept in mind when applying logic to the real world.
Our primary goal is to determine net harm or benefit. If I do a study as to whether or not something causes harm or benefit, and see no change in underlying rates, then it is non-harmful. If it is making some people slightly more likely to get cancer, and others slightly less likely to get cancer, then there’s no net harm—there are just as many cancers as there were before. I may have changed the distribution of cancers in the population, but I have certainly not caused any net harm to the population.
This study’s purpose is to look at the net effect of the treatment. If we see the same amount of hyperactivity in the population prior to and after the study, then we cannot say that the dye causes hyperactivity in the general population.
“But,” you complain, “Clearly some people are being harmed!” Well yes, some people are worse off after the treatment in such a theoretical case. But here’s the key: for the effect NOT to show up in the general population, then you have only three major possibilities:
1) The people who are harmed are such a small portion of the population as to be statistically irrelevant.
2) There are just as many people who are benefitting from the treatment and as such NOT suffering from the metric in question, who would be otherwise, as there are people who would not be suffering from the metric without the treatment but are as a result of it. (this is extremely unlikely, as the magnitude of the effects would have to be extremely close to cancel out in this manner)
3) There is no effect.
If our purpose is to make [b]the best possible decision with the least possible amount of money spent[/b] (as it should always be), then a study on the net effect is the most efficient way of doing so. Testing every single possible SNP substitution is not possible, ergo, it is an irrational way to perform a study on the effects of anything. The only reason you would do such a study is if you had good reason to believe that a specific substitution had an effect either way.
Another major problem you run into when you try to run studies “your way” (more commonly known as “the wrong way”) is the blue M&M problem. You see, if you take even 10 things, and test them for an effect, you have a 40% chance of finding at least one false correlation. This means that in order to have a high degree of confidence in the results of your study, you must increase the threshold for detection—massively. Not only do you have to account for the fact that you’re testing more things, you also have to account for all the studies that don’t get published which would contradict your findings (publication bias—people are far more likely to report positive effects than non-effects).
In other words, you are not actually making a rational criticism of these studies. In fact, you can see exactly where you go wrong:
[quote]If 10% of kids become more hyperactive and 10% become less hyperactive after eating food coloring, such a methodology will never, ever detect it.[/quote]
While possible, how [b]likely[/b] is this? The answer is “Not very.” And given Occam’s Razor, we can mostly discard this barring evidence to the contrary. And no, moronic parents are not evidence to the contrary; you will find all sorts of idiots who claim that all sorts of things that don’t do anything do something. Anecdotes are not evidence.
This is a good example of someone trying to apply logic without actually trying to understand what the underlying problem is. Without understanding what is going on in the first place, you’re in real trouble.
I will note that your specific example is flawed in any case; the idea that these people are in fact being effected is deeply controvertial, and unfortunately a lot of it seems to involve the eternal crazy train (choo choo!) that somehow, magically, artifically produced things are more harmful than “naturally” produced things. Unfortunately this is largely based on the (obviously false and irrational) premise that things which are natural are somehow good for you, or things which are “artificial” are bad for you—something which has utterly failed to have been substantiated by and large. You should always automatically be deeply suspect of any such people, especially when you see “parents claim”.
The reason that the FDA says that food dyes are okay is because there is no evidence to the contrary. Food dye does not cause hyperactivity according to numerous studies, and in fact the studies that fail to show the effect are massively more convincing than those which do due to publication bias and the weakness of the studies which claim positive effects.
Correct. But neither can we say that the dye does not cause hyperactivity in anyone.
Like that. That’s what we can’t say from the result of this study, and some other similar studies. For the reasons I explained in detail above.
Your making the claim “no evidence to the contrary” shows that you have not read the literature, have not done a PubMed search on “ADHD, food dye”, and have no familiarity with toxicity studies in general. There is always evidence to the contrary. An evaluation weighs the evidence on both sides. You can take any case where the FDA has said “There is no evidence that X”, and look up the notes from the panel they held where they considered the evidence for X and decided that the evidence against X outweighed it.
If you believe that there is no evidence that food dyes cause hyperactivity, fine. That is not the point of this post. This post analyzes the use of a statistical test in one study, and shows that it was used incorrectly to justify a conclusion which the data does not justify.
(A) I analyzed their use of math and logic in an attempt to prove a conclusion, and showed that they used them incorrectly and their conclusions are therefore not logically correct. They have not proven what they claim to have proven.
(B) The answer is, “This is very likely.” This is how studies turn out all the time, partly due to genetics. Different people have different genetics, different bacteria in their gut, different lifestyles, etc. This makes them metabolize food differently. It makes their brain chemistry different. Different people are different.
That’s one of the problems I was pointing out! The F-test did not pass the threshold for detection. The threshold is set so that things that pass it are considered to be proven, NOT so that things that don’t pass it are considered disproven. Because of the peculiar nature of an F-test, not passing the threshold is not even weak evidence that the hypothesis being tested is false.
People aren’t that different. I really doubt that, for example, there are people whose driving skills improve after drinking the amount of alcohol contained in six cans of beer.
You haven’t searched hard:
Consider the negative effects of high nervousness on driving skills, the nervousness-reducing effects of alcohol, the side effects of alcohol withdrawal on alcoholics, and the mediating effects of high body mass on the effects of alcohol:
A severely obese alcoholic who is nervous enough about driving and suffering from the shakes might perform worse stone-cold sober than he does with the moderate BAC that he has after drinking a six-pack.
What are the odds that there exists at lease one sufficiently obese alcoholic who is nervous about driving?
That data point would not provide notable evidence that alcohol improves driving in the general population.
The phrase “There is no evidence that X” is the single best indicator of someone statistically deluded or dishonest.
I’d normally take “evidence that [clause]” or “evidence for [noun phrase]” to mean ‘(non-negligible) positive net evidence’. (But of course that can still be a lie, or the result of motivated cognition.) If I’m talking about evidence of either sign, I’d say “evidence whether [clause]” or “evidence about [noun phrase]”.
I think your usage is idiosyncratic. People routinely talk about evidence for and against, and evidence for is not the net, but the evidence in favor.
It’s quite standard to talk about evidence for and against a proposition in exactly this way, as he reports the FDA did. Having talked about “the evidence for” and weighing against the “evidence against”, you don’t then deny the existence of the “evidence for” just because, in balance, you find the evidence against more convincing.
You’re slicing the language so thinly, and in such a nonstandard way, it seems like rationalization and motivated reasoning. No evidence means no evidence. No means no. It can mean *very very little too”. Fine. But it doesn’t mean “an appreciable amount that has a greater countervailing amount”.
But here the FDA has taken “The balance of the evidence is not enough for to be sure enough” and said “There is no evidence for”. The evidence cited as “no evidence” should move the estimate towards 84% certain that there is an effect in the general population.
Very good point.
In this case, honest eyeballing of the data would lead one to conclude that there is an effect.
There actually isn’t any evidence against an effect hypothesis, because they’re not testing an effect hypothesis for falsification at all. There just isn’t enough evidence against the null by their arbitrarily too high standard.
And this is the standard statistical test in medicine, whereby people think they’re being rigorously scientific. Still just 2 chromosomes away from chimpanzees.
This is why you never eyeball data. Humans are terrible at understanding randomness. This is why statistical analysis is so important.
Something that is at 84% is not at 95%, which is a low level of confidence to begin with - it is a nice rule of thumb, but really if you’re doing studies like this you want to crank it up even further to deal with problems with publication bias. publish regardless of whether you find an effect or not, and encourage others to do the same.
Publication bias (positive results are much more likely to be reported than negative results) further hurt your ability to draw conclusions.
The reason that the FDA said what they did is that there isn’t evidence to suggest that it does anything. If you don’t have statistical significance, then you don’t really have anything, even if your eyes tell you otherwise.
Some are more terrible than others. A little bit of learning is a dangerous thing. Grown ups eyeball their data and know the limits of standard hypothesis testing.
Yeah, evidence that the FDA doesn’t accept doesn’t exist.
The people who believe that they are grown-ups who can eyeball their data and claim results which fly in the face of statistical rigor are almost invariably the people who are unable to do so. I have seen this time and again, and Dunning-Kruger suggests the same—the least able are very likely to do this based on the idea that they are better able to do it than most, whereas the most able people will look at it and then try to figure out why they’re wrong, and consider redoing the study if they feel that there might be a hidden effect which their present data pool is insufficient to note. However, repeating your experiment is always dangerous if you are looking for an outcome (repeating your experiment until you get the result you want is bad practice, especially if you don’t adjust things so that you are looking for a level of statistical rigor that is sufficient to compensate for the fact that you’re doing it over again), so you have to keep it very carefully in mind and control your experiment and set your expectations accordingly.
The problem we started with was that “statistical rigor” is generally not rigorous. Those employing it don’t know what it would mean under the assumptions of the test, and fewer still know that the assumptions make little sense.
[quote]Correct. But neither can we say that the dye does not cause hyperactivity in anyone.[/quote]
No, but that is not our goal in the first place. Doing a test on every single possible trait is economically infeasible and unreasonable; ergo, net impact is our best metric.
The benefit is “we get a new food additive to use”.
The net cost is zero in terms of health impact (no more hyperactivity in the general population).
Ergo, the net benefit is a new food additive. This is very simple math here. Net benefit is what we care about in this case, as it is what we are studying. If it redistributes ailments amongst the population, then there may be even more optimal uses, but we’re still looking at a benefit.
If you want to delve deeper, that’s going to be a seperate experiment.
[quote]Your making the claim “no evidence to the contrary” shows that you have not read the literature, have not done a PubMed search on “ADHD, food dye”, and have no familiarity with toxicity studies in general. There is always evidence to the contrary. An evaluation weighs the evidence on both sides. You can take any case where the FDA has said “There is no evidence that X”, and look up the notes from the panel they held where they considered the evidence for X and decided that the evidence against X outweighed it.[/quote]
Your making the claim “evidence to the contrary” suggests that any of this is worth anything. The problem is that, unfortunately, it isn’t.
If someone does a study on 20 different colors of M&Ms, then they will, on average, find that one of the M&Ms will change someone’s cancer risk. The fact that their study showed that, with 95% confidence, blue M&Ms increased your odds of getting cancer, [b]is not evidence for the idea that blue M&M’s cause cancer[/b].
Worse, the odds of the negative finding studies being published is considerably less than the probability of the positive finding study being published. This is known as “publication bias”. Additionally, people are more likely to be biased against artificial additives than towards them, particularly “independent researchers” who very likely are researching it precisely because they harbor the belief that it does in fact have an effect.
This is very basic and is absolutely essential to understanding any sort of data of this sort. When I say that there is no evidence for it, I am saying precisely that—just because someone studied 20 colors of M&M’s and found that one has a 95% chance of causing more cancer tells me nothing. It isn’t evidence for anything. It is entirely possible that it DOES cause cancer, but the study has failed to provide me for evidence of that fact.
You are thinking in terms of formal logic, but that is not how science works. If you lack sufficient evidence to invalidate the null hypothesis, then you don’t have evidence. And the problem is that a mere study is often insufficient to actually demonstrate it unless the effects are extremely blatant.
quote The answer is, “This is very likely.” This is how studies turn out all the time, partly due to genetics. Different people have different genetics, different bacteria in their gut, different lifestyles, etc. This makes them metabolize food differently. It makes their brain chemistry different. Different people are different.[/quote]
For this to happen, you would require that the space to be very similar in size on both ends.
Is it possible for things to help one person and harm another? Absolutely.
Is it probable that something will help almost exactly as many people as it harms? No. Especially not some random genetic trait (there are genetic traits, such as sex, where this IS likely because it is an even split in the population, so you do have to be careful for that, but sex-dependence of results is pretty obvious).
The probability of equal distribution of the traits is vastly outweighed by the probability of it not being equally distributed. Ergo the result you are espousing is in fact extremely unlikely.
This is very basic and is absolutely essential to understanding any sort of data of this sort. When I say that there is no evidence for it, I am saying precisely that—just because someone studied 20 colors of M&M’s and found that one has a 95% chance of causing more cancer tells me nothing. It isn’t evidence for anything. It is entirely possible that it DOES cause cancer, but the study has failed to provide me for evidence of that fact.
When I said that “making the claim “no evidence to the contrary” shows that you have not read the literature, have not done a PubMed search on “ADHD, food dye”, and have no familiarity with toxicity studies in general,” I meant that literally. I’m well-aware of what 95% means and what publication bias means. If you had read the literature on ADHD and food dye, you would see that it is closer to a 50-50 split between studies concluding that there is or is not an effect on hyperactivity. You would know that some particular food dyes, e.g., tartrazine, are more controversial than others. You would also find that over the past 40 years, the list of food dyes claimed not to be toxic by the FDA and their European counterparts has been shrinking.
If you were familiar with toxicity studies in general, you would know that this is usually the case for any controversial substance. For instance, the FDA says there is “no evidence” that aspartame is toxic, and yet something like 75% of independent studies of aspartame concluded that it was toxic. The phrase “no evidence of toxicity”, when used by the FDA, is shorthand for something like “meta-analysis does not provide us with a single consistent toxicity narrative that conforms to our prior expectations”. You would also know that toxicity studies are frequently funded by the companies trying to sell the product being tested, and so publication bias works strongly against findings of toxicity.
Suppose their exists a medication that kills 10% of the rationalists who take it (but kills nobody of other thought patterns), and saves the lives of 10% of the people who take it, but only by preventing a specific type of heart disease that is equally prevalent in rationalists as in the general population.
A study on the general population would show benefits, while a study on rationalists would show no effects, and a study on people at high risk for a specific type of heart disease would show greater benefits.
Food dye is allegedly less than 95% likely to cause hyperactivity in the general population. It has been alleged to be shown that it is more than 95% likely to cause hyperactivity in specific subgroups. It is possible for both allegations to be true.
Yes, but it is not a probable outcome, as for it to be true, it would require a counterbalancing group of people who benefit from it or for the subgroups to be extremely small; however, the allegations are that the subgroups are NOT small enough that the effect could have been hidden in this manner, suggesting that there is no effect on said subgroups as the other possibility is unlikely.
Strictly speaking, the subgroup in question only has to be one person smaller than everybody for those two statements to be compatible.
Suppose that there is no effect on 10% of the population, and a consistent effect in 90% of the population that just barely meets the p<.05 standard when measured using that subgroup. If that measurement is make using the whole population, p>.05
95% is an arbitrarily chosen number which is a rule of thumb. Very frequently you will see people doing further investigation into things where p>0.10, or if they simply feel like there was something interesting worth monitoring. This is, of course, a major cause of publication bias, but it is not unreasonable or irrational behavior.
If the effect is really so minor it is going to be extremely difficult to measure in the first place, especially if there is background noise.
It’s not a rule of thumb; it’s used as the primary factor in making policy decisions incorrectly. In this specific example, the regulatory agency made the statement “There is no evidence that artificial colorings are linked to hyperactivity” based on the data that artificial colorings are linked to hyperactivity with p~.13
There are many other cases in medical where 0.05p<.5 is used as evidence against p.
I’ve similarly griped here in the past about the mistaken ways medical tests are analyzed here and elsewhere, but I think you over complicated things.
The fundamental error is misinterpreting a failure to reject a null hypothesis for a particular statistical test, a particular population, and a particular treatment regime as a generalized demonstration of the null hypothesis that the medication “doesn’t work”. And yes, you see it very often, and almost universally in press accounts.
You make a good point about how modeling response = effect + error leads to confusion. I think the mistake is clearer written as “response = effect + noise”, where noise is taken as a random process injecting ontologically inscrutable perturbations of the response. If you start with the assumption that all differences from the mean effect are due to ontologically inscrutable magic, you’ve ruled out any analysis of that variation by construction.
OK, I may be dense today, but you lost me there. I tried to puzzle out how the raven sentences could be put symbolically so that they each corresponded to one of the negations of your original logic sentence, and found that fruitless. Please clarify?
The rest of the post made sense. I’ll read through the comments and figure out why people seem to be disagreeing first, which will give me time to think whether to upvote.
First, we start with the symbolic statement:
Next, we replace the variables with English names:
Next, we replace the symbols with English phrases:
Then we clean up the English:
We can repeat the process with the other sentence, being careful to use the same words when we replace the variables:
becomes
becomes
and finally:
(I should note that my English interpretation of ∃y P(x,y) is probably a bit different and more compact than PhilGoetz’s, but I think that’s a linguistic rather than logical difference.)
You certainly gave me the most-favorable interpretation. But I just goofed. I fixed it above. This is what I was thinking, but my mind wanted to put “black(x)” in there because that’s what you do with ravens in symbolic logic.
A) Not everything is a raven: !∀x raven(x)
B) Everything is not a raven: ∀x !raven(x)
The new version is much clearer. My interpretation of the old version was that y was something like “attribute,” so you could say “Not every black thing has being a raven as one of its attributes” or “for every black thing, it does not have an attribute which is being a raven.” Both of those are fairly torturous sentences in English but the logic looks the same.
That’s where I don’t follow. I read the original sentence as “for every x there is an y such that the relationship P obtains between x and y”. I’m OK with your assigning “black things” to x but “raven-nature” needs explanation; I don’t see how to parse it as a relationship between two things previously introduced.
The edited version makes more sense to me now.
You’re right! I goofed on that example. I will change it to a correct example.
If 11 out of 11 children studied have a property (no food coloring hyperactivity response), that’s a bit stronger than “there exist 11 children with this property”, though perhaps not quite “all children have this property”.
That’s not how it works. You measure the magnitude of an effect, then do a statistical test of the hypothesis that all of the children have a response, which gives a cutoff that the effect magnitude must reach to accept that hypothesis with 95% confidence. If only 10% of the children have such a response, you won’t reach that cutoff. If 10% have a positive response and 10% have a negative response, you will detect nothing, no matter how big your sample is.
Or rather, you can conclude that, if there were no effect of food dye on hyperactivity and we did this test a whole lotta times, then we’d get data like this 16% of the time, rather than beneath the 5%-of-the-time maximum cutoff you were hoping for.
It’s not so easy to jump from frequentist confidence intervals to confidence for or against a hypothesis. We’d need a bunch of assumptions. I don’t have access to the original article so I’ll just make shit up. Specifically, if I assume that we got the 84% confidence interval from a normal distribution in which it was centrally located and two-tailed, then the corresponding minimum Bayes Factor is 0.37 for the model {mean hyperactivity = baseline} versus the model {mean hyperactivity = baseline + food dye effect}. Getting to an actual confidence level in the hypothesis requires having a prior. Since I’m too ignorant of the subject material to have an intuitive sense of the appropriate prior, I’ll go with my usual here which is to charge 1 nat per parameter as a complexity penalty. And that weak complexity prior wipes out the evidence from this study.
So given these assumptions, the original article’s claim...
...would be correct.
All you’re saying is that studies should use Bayesian statistics. No medical journal articles use Bayesian statistics.
Given that the frequentist approach behind these tests is “correct”, the article’s claim is incorrect. The authors intended to use frequentist statistics, and so they made an error.
If a weak default complexity prior of 1 nat for 1 extra variable wipes out 84% confidence, that implies that many articles have incorrect conclusions, because 95% confidence might not be enough to account for a one-variable complexity penalty.
In any case, you are still incorrect, because your penalty cannot prove that the null hypothesis is correct. It can only make it harder to prove it’s incorrect. Failure to prove that it is incorrect is not proof that it is correct. Which is a key point of this post.
Nah, they’re welcome to use whichever statistics they like. We might point out interpretation errors, though, if they make any.
Under the assumptions I described, a p-value of 0.16 is about 0.99 nats of evidence which is essentially canceled by the 1 nat prior. A p-value of 0.05 under the same assumptions would be about 1.92 nats of evidence, so if there’s a lot of published science that matches those assumptions (which is dubious), then they’re merely weak evidence, not necessarily wrong.
It’s not the job of the complexity penalty to “prove the null hypothesis is correct”. Proving what’s right and what’s wrong is a job for evidence. The penalty was merely a cheap substitute for an informed prior.
I think part of the problem is that there is a single confidence threshold, usually 90%. The problem is that setting the threshold high enough to compensate for random flukes and file drawer effects causes problems when people start interpreting threshold—epsilon to mean the null hypothesis has been proven. Maybe it would be better to have two thresholds with results between them interpreted as inconclusive.
That is part of the problem. If it weren’t for using a cutoff, then it would be the case that “proving” ”! forall X P(X)” with high confidence would be evidence for “for many X, !P(X)”, as several of the comments below are claiming.
But even if they’d used some kind of Bayesian approach, assuming that all children are identical would still mean they were measuring evidence about the claim “X affects all Y”, and that evidence could not be used to conclusively refute the claim that X affects some fraction of Y.
Using a cutoff, though, isn’t an error. It’s a non-Bayesian statistical approach that loses a lot of information, but it can give useful answers. It would be difficult to use a Bayesian approach in any food toxicity study, because setting the priors would be a political problem. They did their statistical analysis correctly.
This post makes a point that is both correct and important. It should be in Main.
[pollid:424]
This post makes a point that is both correct and important. A post that makes this point should be in Main.
The reception of this post indicates that the desired point is not coming through to the target audience. That matters.
No it doesn’t. It takes the word “all” as used in everyday language and pretends it is intended to be precisely the same as the logical “all” operator, which it of course it is not. It’s the worst kind of nitpicking, the kind of “all people who have a heart attack should go to a licensed hospital”—“nuh-uh, not if the hospital is on fire / not if they are billionaires with a fully equipped medical team in their attic”.
What on Earth is “important” about such a point?
Not even that. It takes the zero-article plural as used in everyday language and pretends it is intended to be precisely the same as the logical “all” operator, which it of course it is not.
But … but … Science?
They tend to be used either for keeping crows from eating your crops or making rivals look bad by misrepresenting them.
I am looking at a claim in a scientific paper. The word “all” in such a claim is universally interpreted by doctors and scientists as being universally quantified. That is how other scientists interpret it when they cite a paper. That is how the FDA treats it when they deny you a drug or a medical procedure.
This is not everyday language. This is a claim to have rigorously proven something.
Even if you don’t focus on the word “all”, which you should, but I accept that you are ignorant of how scientific discourse works, it is still the fact that the paper did not provide ANY evidence that food dye does not affect behavior. You can fail an F-test for a hypothesis even with data that supports the hypothesis.
Not universally interpreted by doctors and scientists. I’m gonna go ahead and say that you have no idea what you’re talking about and go off of what you think “all” should mean in ‘all’ the sciences, not what it defaults to in actual medical papers. Context!
No medical publications whatsoever can use the “all” quantifier without restricting the scope, implicitly or explicitly. Whenever you find an “all” quantifier without a restriction specified, that’s at best a lazy omission or at worst an automatic error. What, a parasympathomimetic drug will slow down a subject’s heart rate for all humans? Have you checked them all?
“Scientists” publishing in medicine don’t get all excited (oooh an “all” quantifier) like you whenever they come across a claim that’s unwisely worded using “all” without explicitly restricting the scope.
Bowing out, I’ll leave you the last word if you want it.
This post makes a point that is both correct and important, but Phil has clearly lost much of the audience and is ticked off besides, and I don’t blame him.
I think we’ve got two issues. The general issue of how one tests a null hypothesis and what it does and does not mean to reject the null, and the particular issue of food dyes. The general issue seems important, while the particular could provide a helpful illustration of the general.
But I would think that someone else, and probably multiple someone’s, have already plowed this ground. Jaynes must have an article on this somewhere.
Anyone got a good article?
Depends on what you want. You could probably get something useful out of my http://lesswrong.com/lw/g13/against_nhst/ collection.
Thanks. Interesting, but it doesn’t really get at the heart of the problem here, of mistaken interpretation of a “failure to reject” result as confirmation of the null hypothesis, thereby privileging the null. That just shouldn’t happen, but often does.
I saw the Gigerenzer 2004 paper (you’re talking about the Null Ritual paper, right?) earlier today, and it rang a few bells. Definitely liked the chart about the delusions surrounding p=0.01. Appalling that even the profs did so poorly.
GG has another 2004 paper with a similar theme: The Journal of Socio-Economics 33 (2004) 587–606 Mindless statistics http://people.umass.edu/~bioep740/yr2009/topics/Gigerenzer-jSoc-Econ-1994.pdf
Isn’t that a major criticism of NHST, that almost all users and interpreters of it reverse the conditionality—a fallacy/confusion pointed by Cohen, Gigerenzer, and almost every paper I cited there?
I think that’s a separate mistake. This paper shows Pr[data|H0] > 0.05. The standard mistake you refer to switches this to falsely conclude Pr[H0|data] > 0.05. However, neither of these is remotely indicative of H0 being true.
Thanks; I was trying to write a comment that said the same thing, but failed to do so.
disagree because not correct.
Phil’s logical interpretation procedure would call shenanigans whether or not the statistical reasoning was correct.
The whole point of statistics is that it can tell us things logic cannot. If there is an important point to be made here, it needs to be made with a statistical analysis, not a logical one.
Logical analysis is a limiting case of statistical analysis, thus problems with logical reasoning have corresponding problems with statistical reasoning. I agree that Phil should have spelled out this distinction explicitly.
Their statistical analysis was correct, modulo their assumptions. They made their logical error in how they interpreted its conclusion.
People. Explain your downvotes of this comment. Do you think their statistical analysis was incorrect? Do you think they made no logical error?
Does what you’re saying here boil down to “failing to reject the null (H0) does not entail rejecting the alternative (H1)”? I have read this before elsewhere, but not framed in quantifier language.
No, it’s more subtle than that. I think it’s more clearly stated in terms of effect sizes. (Down with null hypothesis significance testing!) The study measured the average effect of food dye on hyperactivity in the population and showed it was not distinguishable from zero. The quoted conclusion makes the unfounded assumption that that all children can be characterized by that small average effect. This ignores unmeasured confounders, which another way of phrasing PhilGoetz’s correct (CORRECT, PEOPLE, CORRECT!) point.
The document I linked mentions doing a “sensitivity analysis for the possibility of unmeasured confounding, to see the sorts of changes one could expect if there were such a confounder.” In the above study (assuming PhilGoetz described it correctly; I haven’t read the original paper), the data permitted such a sensitivity analysis. It would have given an upper bound for the effect of the unmeasured confounder as a function of an assumed prevalence of the confound in the population. (A smaller assumed prevalence gives a larger upper bound.) But if you don’t even notice that it’s possible for children to have heterogeneous responses to the treatment, you’ll never even think of doing such a sensitivity analysis.
Yes, but I also spelled out why I think they’re making that mistake. They’re trying to claim the authority of logic, but not being rigorous and forgetting that logical statements shouldn’t contain unquantified variables.
I think the picture is not actually so grim: the study does reject an entire class of (distributions of) effects on the population.
Specifically, it cannot be the case (with 95% certainty or whatever) that a significant proportion of children are made hyperactive, while the remainder are unaffected. This does leave a few possibilities:
Only a small fraction of the children were affected by the intervention.
Although a significant fraction of the children were affected by the intervention in one direction, the remainder were affected in the opposite direction.
A mix of the two (e.g. a strong positive effect in a few children, and a weak negative effect in many others).
The first possibility would be eliminated by a study with more participants (the smaller the fraction of children affected, the more total children you need to notice).
The second possibility is likely to be missed by the test entirely, since the net effect is much weaker than the net absolute effect. However, careful researchers should notice that the response distribution is bimodal (again, given sufficiently many children). Of course, if the researchers aren’t careful...
Specifically, it cannot be the case, with 95% certainty, that all children are made hyperactive. That is exactly what they proved with their F-tests (though if you look at the raw data, the measures they used of hyperactivity conflicted with each other so often that it’s hard to believe they measured anything at all). They did not prove, for instance, that it cannot be the case, with 95% certainty, that all but one of the children are made hyperactive.
Yet they claimed, as I quoted, that they proved that no children are made hyperactive. It’s a logic error with large consequences in healthcare and in other domains.
You’re correct that the study data is useful and rules out some possibilities. But the claim they made in their summary is much stronger than what they showed.
They did not say this but I am confident that if this bizarre hypothesis (all but one of what group of children, exactly?) were tested, the test would reject it as well. (Ignoring the conflicting-measures-of-hyperactivity point, which I am not competent to judge.)
In general, the F-test does not reject all alternate hypotheses equally, which is a problem but a different one. However, it provides evidence against all hypotheses that imply an aggregate difference between the test group and control group: equivalently, we’re testing if the means are different.
If all children are made hyperactive, the means will be different, and the study rejects this hypothesis. But if 99% of children are made hyperactive to the same degree, the means will be different by almost the same amount, and the test would also reject this hypothesis, though not as strongly. I don’t care how you wish to do the calculations, but any hypothesis that suggests the means are different is in fact made less likely by the study.
And hence my list of alternate hypotheses that may be worth considering, and are not penalized as much as others. Let’s recap:
If the effect is present but weak, we expect the means to be close to equal, so the statistical test results don’t falsify this hypothesis. However, we also don’t care about effects that are uniformly weak.
If the effect is strong but present in a small fraction of the population, the means will also be close to equal, and we do care about such an effect. Quantifying “strong” lets us quantify “small”.
We can allow the effect to be strong and present in a larger fraction of the population, if we suppose that some or all of the remaining children are actually affected negatively.
This is math. You can’t say “If 2+2 = 4, then 2+1.9 = 4.” There is no “as strongly” being reported here. There is only accept or reject.
The study rejects a hypothesis using a specific number that was computed using the assumption that the effect is the same in all children. That specific number is not the correct number number to reject the hypothesis that the effect is the same in all but one.
It might so happen that the data used in the study would reject that hypothesis, if the correct threshold for it were computed. But the study did not do that, so it cannot claim to have proven that.
The reality in this case is that food dye promotes hyperactivity in around 15% of children. The correct F-value threshold to reject that hypothesis would be much, much lower!
I don’t think we actually disagree.
Edit: Nor does reality disagree with either of us.
You’re correct in a broader sense that passing the F-test under one set of assumptions is strong evidence that you’ll pass it with a similar set of assumptions. But papers such as this use logic and math in order to say things precisely, and while what they claimed is supported, and similar to, what they proved, it isn’t the same thing, so it’s still an error, just as 3.9 is similar to 4 for most purposes, but it is an error to say that 2 + 1.9 = 4.
The thing is, some such reasoning has to be done in any case to interpret the paper. Even if no logical mistake was made, the F-test can’t possibly disprove a hypothesis such as “the means of these two distributions are different”. There is always room for an epsilon difference in the means to be compatible with the data. A similar objection was stated elsewhere on this thread already:
And of course it’s legitimate to give up at this step and say “the null hypothesis has not been rejected, so we have nothing to say”. But if we don’t do this, then our only recourse is to say something like: “with 95% certainty, the difference in means is less than X”. In other words, we may be fairly certain that 2 + 1.9 is less than 5, and we’re a bit less certain that 2 + 1.9 is less than 4, as well.
Incidentally, is there some standard statistical test that produces this kind of output?
When people do studies of the effects of food coloring on children, are the children blindfolded?
That is, can the studies discern the neurochemical effects of coloring molecules from the psychological effects of eating brightly-colored food?
I expect that beige cookies are not as exciting as vividly orange cookies.
My read of the Mattes & Gittelman paper is that they’re comparing natural and artificial food coloring.
I think that should be: The tests compute what is a difference in magnitude of response such that, if the null hypothesis is true, then 95% of the time the measured effect difference will not be that large.
Frequentist statistics cannot make the claim that with some probabilty the null hypothesis is true or false. Ever. You must have a prior and invoke Bayes theorem to do that.
I’m not as interested in proving my point, as in figuring out why people resist it so strongly. It seems people are eager to disagree with me and reluctant to agree with me.
How did the post make you feel, and why?
It’s not just their feelings, it’s their priors.
I’ve found previously that many people here are extremely hostile to criticisms of the statistical methods of the medical establishment. It’s extremely odd at a site that puts Jaynes on a pedestal, as no one rants more loudly and makes the case clearer than Jaynes did, but there it is.
But consider if you’re not a statistician. You’re not into the foundations of statistical inference. You haven’t read Jaynes. Maybe you’ve had one semester of statistics in your life. When you’re taught hypothesis testing, you’re taught a method. That method is statistics. There’s no discussion about foundations. And you look at medical journals. This is how they do it. This is how science is done. And if you’re a rationalist, you’re on Team Science.
Elsewhere in the thread, there are links to a Gigerenzer paper showing how statistics students and their professors are making fundamental errors in their interpretations of the results of confidence interval testing. If stat professors can’t get it right, the number of people who have any notion that there is a possibility of an issue is vanishingly small. Higher here than usual, but still a minority.
Meanwhile, you’ve shown up and attacked Team Science in general in medicine. To add the cherry on top, you did it in the context of exactly the kind of issue that Team Science most rants about—some “anecdotal hysteria” where parents are going ballistic about some substance that is supposedly harming their precious little lumpkins. But everyone knows there is nothing wrong with food dies. They’ve been around for decades. The authorities have tested them and declared them safe. There is no evidence for the claim, and it’s harmful to be drumming up these scares.
My PhD was in EE, doing statistical inference in a machine learning context. Rather unsophisticated stuff, but that’s something of the point. Science is not an integrated whole where the best thoughts, best practices, and best methods are used everywhere instantaneously. It takes decades, and sometimes wrong turns are taken. I’m talking to professional data analysts in big money companies, and they’ve never even heard of guys taken as canonical authorities in machine learning circles. I was reading Jaynes when his book was a series of postscript files floating around the web. The intro graduate classes I took from the stat department didn’t discuss any of these issues at all. I don’t think any class did. Anything I got in that regard I got from reading journal articles, academic mailing lists, and thinking about it myself. How many people have done that?
You’re saying that medical science as it is done sucks. You’re right, but what do you think the prior is on Team Science that Phil Goetz is right, and Team Science is wrong?
(Eliezer does, anyway. I can’t say I see very many quotes or invocations from others.)
I am hostile to some criticisms, because in some cases when I see them being done online, it’s not in the spirit of ‘let us understand how these methods make this research fundamentally flawed, what this implies, and how much we can actually extract from this research’*, but in the spirit of ‘the earth is actually not spherical but an oblate spheroid thus you have been educated stupid and Time has Four Corners!’ Because the standard work has flaws, they feel free to jump to whatever random bullshit they like best. ‘Everything is true, nothing is forbidden.’
* eg. although extreme and much more work than I realistically expect anyone to do, I regard my dual n-back meta-analysis as a model of how to react to potentially valid criticisms. Instead of learning that passive control groups are a serious methodological issue which may be inflating the effect and then going around making that point at every discussion, and saying ‘if you really want to increase intelligence, you ought to try this new krill oil stuff!’, I compiled studies while noting whether they were active or passive, and eventually produced a meta-analytic regression confirming the prediction.
(And of course, sometimes I just think the criticism is being cargo-culted and doesn’t apply; an example on this very page is tenlier’s criticism of the lack of multiple testing correction, where he is parroting a ‘sophisticated’ criticism that does indeed undermine countless papers and lead to serious problems but the criticism simply isn’t right when he uses it, and when I explained in detail why he was misapplying it, he had the gall to sign off expressing disappointment in LW’s competency!)
Such criticism for most people seems to only enable confirmation bias or one-sided skepticism, along the lines of knowing about biases can hurt you.
A great example here is Seth Roberts. You can read his blog and see him posting a list of very acute questions for Posit Science on the actual validity of a study they ran on brain training, probing aspects like the endpoints, possibility of data mining, publication bias or selection effects and so on, and indeed, their response is pretty lame because it’s become increasingly clear that brain training does not cause far transfer and the methodological weaknesses are indeed driving some of the effects. And then you can flip the page and see him turn off his brain and repeat uncritically whatever random anecdote someone has emailed him that day, and make incredibly stupid claims like ‘evidence that butter is good for your brain comes from Indian religion’s veneration for ghee’ (and never mind that this is pulled out of countless random claims from all sorts of folk medicines; or that Ayurvedic medicine is completely groundless, random crap, and has problems with heavy metal poisoning!), and if you try sending him a null result on one of his pet effects, he won’t post it. It’s like there’s two different Roberts.
Put Jaynes on a pedestal, you mean?
Hmmmn. I had that problem before with Korzybbski. I saw The Map is Not the Territory, and assumed people were familiar with Korzybski and his book Science and Sanity. Turns out even Eliezer hadn’t read it, and he got his general semantics influence from secondary writers such as Hayakawa. I’ve only found 1 guy here with confirmed knowledge of a breadth of general semantics literature.
How many people do you think have read substantial portions of Jaynes book? Have you?
Yes. To put it one way, a site search for ‘Jaynes’ (which will hit other people sometimes discussed on LW, like the Bicameral Mind Jaynes) turns up 718 hits; in contrast, to name some other statisticians or like-minded folks - ‘Ioannidis’ (which is hard to spell) turns up 89 results, ‘Cochrane’ 57, ‘Fisher’ 127, ‘Cohen’ 193, ‘Gelman’ 128, ‘Shalizi’ 109… So apparently in the LW-pantheon-of-statisticians-off-the-top-of-my-head, Jaynes can barely muster a majority (718 vs 703). For someone on a pedestal, he just isn’t discussed much.
Most of those in the book reading clubs fail, he is rarely quoted or cited… <5%.
I bought a copy (sitting on my table now, actually), read up to chapter 4, some sections of other chapters that were interesting, and concluded that a number of reviews were correct in claiming it was not the best introduction for a naive non-statistician.
So I’ve been working through other courses/papers/books and running experiments and doing analyses of my own to learn statistics. I do plan to go back to Jaynes, but only once I have some more learning under my belt—the Probabilistic Graphical Models Coursera is starting today, and I’m going to see if I can handle it, and after that I’m going to look through and pick one of Kruschke’s Doing Bayesian Data Analysis, Sivia’s Data Analysis: A Bayesian Tutorial, Bolstad’s Introduction to Bayesian Statistics, and Albert’s Bayesian Computation with R. But we’ll see how things actually go.
The problem is that it’s very hard to change your world view or even to coherently understand the worldview of someone else. Understanding that you might be wrong about things you take for granted is hard.
Among new atheists even the notion that the nature of truth is up for discussion is a very threatening question.
Even if they would read Jaynes from cover to cover, they take the notion of truth they learned as children for granted and don’t think deeply about where Jaynes notion of truth differs from their own.
The discussion about Bayesianism with David Chapman illustrates how he and senior LW people didn’t even get clear about the points on which they disagree.
I don’t know if it’s threatening, and I doubt that it applies to Dennett, but the other guys can’t seem to even conceive of truth beyond correspondence.
But if it’s a matter of people being open to changing their world view, to even understanding that they have one, and other people have other world views, it’s Korzybski they need to read, not Jaynes.
The guy with the blog is Chapman?
I don’t see a discussion. I see a pretty good video, and blog comments that I don’t see any value at all in. I had characterized them more colorfully, but seeing that Chapman is on the list, I decided to remove the color.
I’m not trying to be rude here, but his comments are just very wrong about probability, and thereby entirely clueless about the people he is criticizing.
As an example
No! Probability as inference most decidedly is not “just arithmetic”. Math tells you nothing axiomatically about the world.. All our various mathematics are conceptual structures that may or may not be useful in the world.
That’s where Jaynes, and I guess Cox before him, adds in the magic. Jaynes doesn’t proceed axiomatically. He starts with problem of representing confidence in a computer, and proceeds to show how the solution to that problem entails certain mathematics. He doesn’t proceed by “proof by axiomatic definitions”, he shows that the conceptual structures work for the problem attacked.
Also, in Jaynes presentation of probability theory as an extension of logic, P(A|B) isn’t axiomatically defined as P(AB)/P(B), it is the mathematical value assigned to the plausibility of a proposition A given that proposition B is taken to be true. It’s not about counting, it’s about reasoning about the truth of propositions given our knowledge.
I guess if he’s failing utterly to understand what people are talking about, what they’re saying might look like ritual incantation to him. I’m sure it is for some people.
Is there some reason I should take David Chapman as particularly authoritative? Why do you find his disagreement with senior LW people of particular note?
Because senior LW people spent effort in replying to him. The post lead to LW posts such as what bayesianism taught me. Scott Alexander wrote in response: on first looking into chapmans pop-bayesianism. Kaj Sotala had a lively exchange in the comments of that article.
I think in total that exchange provides a foundation for clearing the question of what Bayesianism is. I do consider that an important question.
As far as authority goes David Chapman did publish academic papers about artificial intelligence. He did develop solutions for previously unsolved AI problems. When he says that there’s no sign of Bayes axiom in the code that he used to solve an AI problem he just might be right.
Dennett is pretty interesting. Instead of asking what various people mean when they say consciousness he just assumes he knows and declares it nonexistent. The idea that maybe he doesn’t understand what other people mean with the term doesn’t come up in his thought.
Dennett writes about how detailed visual hallucinations are impossible. I do have had experiences where what I visually perceived didn’t change much whether or not I closed my eyes. It was after I spent 5 days in artificial coma. I know two additional people who I meet face to face who have had similar experiences.
I also have access to various accounts of people hallucinating stuff in other context via hypnosis. My own ability let myself go is unfortunately not good, so I still lack some first hand accounts of some other hallucinations.
A week ago I spoke at our local LW meetup with someone who said that while “IQ” obviously exists “free will” obviously doesn’t. At that point in time I didn’t know exactly how to resolve the issue but it seems to me that those are both concept that exist somehow on the same level. You won’t find any IQ atoms and you won’t find any free will atoms but they are still mental concepts that can be used to model things about the real world.
That a problem that arises by not having a well defined idea of what it means for concepts to exist. In practice that leads to terms like depression getting defined by committee and written down in the DSM-V and people simply assuming that depression exists without asking themselves in what way it exists. If people would ask themselves in what way it exist that might provide ground for a new way to think about depression.
The problem with Korzybski is that he’s hard to read. Reading and understanding him, is going to be hard work for most people who are not exposed to that kind of thinking.
What might be more readable is Barry Smith’s paper “Against Fantology”. It’s only 20 pages.
I think that’s what the New Atheists like Dennett do. They simply pretend that the things that don’t fit in their worldview don’t exist.
I think you’re being unfair to Dennett. He actually has availed himself of the findings of other fields, and has been at the consciousness shtick for decades. He may not agree, but it’s unlikely he is unaware.
And when did he say consciousness was nonexistent?
Cite? That seems a rather odd thing for him to say, and not particularly in his ideological interests.
Cite here? Again, except for supernatural bogeymen, my experience of him is that he recognizes that all sorts of mental events exists, but maybe not in the way that people suppose.
Not accurate. If those things don’t fit in their world views, they don’t exist in them, so they’re not pretending.
On a general brouhaha with CHapman, I seemed to miss most of that. He did one post on Jaynes and A_p, which I read as I’ve always been interested in that particular branch of Jaynes’ work. But the post made a fundamental mistake, IMO, and the opinion of others, and I think Chapman admitted as much before all of his exchanges were over. So even with Chapman running the scoreboard, he’s behind in points.
Well, for one thing, Chapman was (at least at one point) a genuine, credentialed AI researcher and a good fraction of content on Less Wrong seems to be a kind of armchair AI-research. That’s the outside view, anyway. The inside view (from my perspective) matches your evaluation: he seems just plain wrong.
I think a few people here are credentialed, or working on their credentials in machine learning.
But almost everything useful I learned, I learned by just reading the literature. There were three main guys I thought had good answers—David Wolpert, Jaynes, and Pearl. I think time has put it’s stamp on approval on my taste.
Reading more from Chapman, he seems fairly reasonable as far as AI goes, but he’s got a few ideological axes to grind against some straw men.
On his criticisms of LW and Bayesianism, is there anyone here who doesn’t realize you need algorithms and representations beyond Bayes Rule? I think not too long ago we had a similar straw man massacre where everyone said “yeah, we have algorithms that do information processing other than Bayes rule—duh”.
And he really should have stuck it out longer in AI, as Hinton has gone a long way to solving the problem Chapman thought was insurmountable—getting proper representation of the space to analyze from the data without human spoon feeding. You need a hidden variable model of the observable data, and should be able to get it from prediction of subsets of the observables using the other observables. That much was obvious, it just took Hinton to find a good way to do it. Others are coming up with generalized learning modules and mapping them to brain constructs. There was never any need to despair of progress.
But you don’t have a complete fossil record, therefore Creationism!
Obviously that’s a problem. This somewhat confirms my comment to Phil, that linking the statistical issue to food dyes made reception of his claims harder as it better fit your pattern than a general statistical argument.
But from the numbers he reported, the basic eyeball test of the data leaves me thinking that food dyes may have an affect. Certainly if you take the data alone without priors, I’d conclude that more likely than not, food dyes have an effect. That’s how I would interpret the 84% significance threshold—probably there is a difference. Do you agree?
Unfortunately, I don’t have JAMA access to the paper to really look at the data, so I’m going by the 84% significance threshold.
I made up the 84% threshold in my example, to show what can happen in the worst case. In this study, what they found was that food dye decreased hyperactivity, but not enough to pass the threshold. (I don’t know what the threshold was or what confidence level it was set for; they didn’t say in the tables. I assume 95%.)
If they had passed the threshold, they would have concluded that food dye affects behavior, but would probably not have published because it would be an embarrassing outcome that both camps would attack.
To be clear, then, you’re not claiming that any evidence in the paper amounts to any kind of good evidence that an effect exists?
You’re making a general argument about the mistaken conclusion of jumping from “failure to reject the null” to a denial that any effect exists.
Yes, I’m making a general argument about that mistaken conclusion. The F-test is especially tricky, because you know you’re going to find some difference between the groups. What difference D would you expect to find if there is in fact no effect? That’s a really hard question, and the F-test dodges it by using the arbitrary but standard 95% confidence interval to pick a higher threshold, F. Results between D and F would still support the hypothesis that there is an effect, while results below D would be evidence against that hypothesis. Not knowing what D is, we can’t say whether failure of an F-test is evidence for or against the hypothesis.
I’d add to the basic statistical problem the vast overgeneralization and bad decision theory.
You hit on one part of that, the generalization to the entire population.
People are different.
But even if they’re the same, U shaped response curves make it unlikely to find a signal. - you have to have the goldilocks amount to show an improvement. People vary over time. going in and out of the goldilocks range. So you when you add something, you’ll be pushing some people into the goldilocks range, and some people out.
It also comes from multiple paths to the same disease. A disease is a set of observable symptoms, not the varying particular causes of the same symptoms. Of course it’s hard to find the signal in a batch of people clustered into a dozen different underlying causes for the same symptoms.
But the bad decisions theory is the worst part, IMO. If you have a chronic problem, a 5% chance of a cure from a low risk, low cost intervention is great. But getting a 5% signal out of black box testing regimes biased against false positives is extremely unlikely, and the bias against interventions that “don’t work” keeps many doctors from trying perfectly safe treatments that have a reasonable chance of working.
The whole outlook is bad. It shouldn’t be “find me a proven cure that works for everyone”. It should be “find me interventions to control the system in a known way.” Get me knobs to turn, and let’s see if any of the knobs work for you.
I believe Knight posted links to fulltext at http://lesswrong.com/lw/h56/the_universal_medical_journal_article_error/8pne
I haven’t looked but I suspect I would not agree and that you may be making the classic significance misinterpretation.
I think the problem is that you talked about statistics in a hand-wavy way and as a result people misunderstood you.
It also didn’t help that the way you interpreted the logical structure of the paper ignored the standard steelmanning of frequentist papers.
What do you mean by your second sentence?
For example, in messy topics like biology, most instances of “all” should be replaced with “most”. In other words, people were translating the universal statements into probabilistic statements. They were subsequently confused when you insisted on treating the problem as logical rather than statistical.
This seems to be a very common nerd argument failure mode.
What is the antecedent of “this”? This isn’t a rhetorical question, I honestly can’t figure out which of several possibilities you’re referring to.
responding to claims as if they are meant literally or arguments as if they’re deductive logical arguments.
It is because it is a statistical problem that you can’t replace “all” with “most”. The F-value threshold was calculated assuming “all”, not “most”. You’d need a different threshold if you don’t mean “all”.
Also, the people I am complaining about explicitly use “all” when they interpret medical journal articles in which a test for an effect was failed as having proven that the effect does not exist for any patients.
I’m not sure if I was included among “people” but in retrospect it seems like I was simply doing this. I’m sorry.
I skimmed over the initial first-order logic bits, was bored by the statistics bits, and came away feeling like the conclusion was pedantic and mostly well-known among my peers.
Thank you for the answer. Now we know (from the many upvoted disagreements in the comments, and the downvoting of every comment I made correcting their mistakes) that the conclusion is not well-known among readers of LessWrong.
I think that the universal quantifier in
is not appropriate.
The original statement
only implicates that artificial food coloring was responsible for all children’s hyperactivity, not that children who ever ate artificial food coloring would inevitably have hyperactivity. So the formula without universal quantifier is more reasonable and thus the final statement of the article is without problem.
No, you are picking up on the lack of details about time that I mentioned. You really don’t want me to write a proposition incorporating the time relationship between when cookies were eaten and when and how behavior was measured. The formula without a quantifier wouldn’t even be well-formed. It would have no meaning.
OK I agree that the word ‘inevitably’ is ambiguous. Regardless of the accuracy of the literal-to-logical translations, I think the reason the logical expression of the statement of the article does not match that of the final conclusion, as your logical reasoning proves, is that the writer were not doing the very logical reasoning but doing medical research and thus proposing something new, not something of equivalent logical consequences.
Their first statement:
only implies that they did not buy the hypothesis, which did not necessarily imply that they accepted the negation of the hypothesis, which corresponds to your first formula:
equivalently:
Even though they actually accepted the negation of the hypothesis, that is to say , accepted your first formula, the final conclusion they got through the research is that:
whose correspondent logical expression is your second formula:
This formula seems stronger than the first one:
From my point of view, I don’t think that the medical researchers were intentionally arbitrarily generalizing their results or just making logical mistakes. They just posed an attitude to an existing hypothesis and give a new one through the article, in this case, a new stronger hypothesis (the word ‘stronger’ depends on whether they actually just negated the original hypothesis).
I think their only fault is that they failed to keep their own medical views in one article logically identical.
It would’ve been very helpful if some sort of glossary or even a Wikipedia link was provided before diving into the use of the notational characters such as those used in “∀x !P(x)”.
Although this post covers an important topic, the first few sentences almost lost me completely, even though I learned what all those characters meant at one time.
And, as LessWrong is rather enamored with statistics, consider that by writing P(x,y), the readers have an exactly 50% chance of getting the opposite meaning unless they have very good recall. :)
That would be interesting if true. I recommend finding another one, since you sya they’re so plentiful. And I also recommend reading it carefully, as the study you chose to make an example of is not the study you were looking for. (If you don’t want to change the exemplar study, it may also be of interest what is your prior for “PhilGoetz is right where all of LessWrong is wrong’)
The different but related question of the proportion of people (especially scientists, regulators, and legislators) misinterpreting such studies might also be worth looking into. It wouldn’t surprise me if people who know better make the same mistake as your logic students, possibly in their subconscious probability sum.
Why is this post not visible under “Main->New”?
The post has been moved to Discussion, I don’t know by whom. Edit: My guess would be EY or Vladimir Nesov. Edit2: Back to Main we go. The Wheels on the Bus Go Round and Round … Edit3: Back to Discussion. Edit4: Back to Main. Whodunnit? Edit 5: Deleted! Wait … sighting confirmed, back in Discussion! Is there no stop condition? Is it a ghost? Is it unkillable?
By whom?
Okay, folks. Now you’re just being assholes. Why are you downvoting me for asking who moved my post?
You have not submitted to their social pressure in other parts of this thread. This offends people (for social-politically rational reasons). They will now attack just about anything you write in this context. You could write the equations for a intuitive and correct physical Theory of Everything and you would still get (some) downvotes.
Note that calling them assholes makes it easier for them and seems to benefit you not in the slightest.
Unfortunately, there’s an error in your logic: You call that type of medical journal article error “universal”, i.e. applicable in all cases. Clearly a universal quantifier if I ever saw one.
That means that for all medical journal articles, it is true that they contain that error.
However, there exists a medical journal article that does not contain that error.
Hence the medical journal error is not universal, in contradiction to the title.
First logical error … and we’re not even out of the title? Oh dear.
Perhaps a clearer title would have been ‘A Universal Quantifier Medical Journal Article Error’. Bit of a noun pile, but the subject of the post is an alleged unjustified use of a universal quantifier in a certain article’s conclusion.
By the way, I think PhilGoetz is 100% correct on this point—i.e., upon failure to prove a hypothesis using standard frequentist techniques, it is not appropriate to claim a result.
Oh come on.
This is not an opinion piece. It is not a difficult or tricky piece. It is CORRECT. If you disagree with me, you are wrong. Read it again.
42% of the voters so far down-voted this piece, indicating that LessWrong readers perform slightly better than random at logic. I am afraid I must give LessWrong as a whole an F on this logic test.
I didn’t downvote, but it’s worth keeping in mind that a downvote doesn’t necessarily convey substantive disagreement with the point you’re making. There are a number of other (appropriate) reasons for downvoting a post in Main.
It’d be really great if PhilGoetz and some other commenters in this thread understood this. So you’ve proven a mathematical statement—why should that net one karma?
The continued downvote whining, as if their existence represented a substantial failing of “LessWrong as a whole” is really beginning to grate on my nerves.
Are you intending the irony there?
BTW, it seems to me that your point is about an error of statistical inference, not first-order logic. Specifically, experimenters not noticing their assumption that the population is homogeneous. When the assumption is wrong, wrong inferences can follow. That is, inferences which correctly follow from the assumption and the experimental results, but which are nevertheless false.
No; the difference is that in your interpretation, we would say “wrong assumption, hence discard results.” In my interpretation, if they had found a significant effect, they would have been able to correctly conclude that there was an effect; and finding no significant effect, they could have correctly concluded the negation of that.
I don’t know what irony you refer to.
I am confused. In your article you said that researchers on food colouring and hyperactivity found no significant effect and concluded there was none, and criticised them for doing that since, according to later work, there is a significant effect among a small subpopulation. Now you are saying that they correctly concluded that there was no significant effect (“no significant effect” being the negation of “significant effect”).
What I said in my comment above is misleading. If they had found an effect, it would have meant something, although they would have again stated it as stronger than it really was: “For all children, food dye affects behavior.” There could in fact have been one food dye monster child whose behavior was radically altered by food dye. Having failed to find an effect, they can conclude that they failed to find an effect on all children, which is still useful information but in practice would be very difficult to use correctly.
One of my fears about indicating such percentages was that instead of the previous practice of people whining whenever the sum of the votes received is negative, they can now whine for a single downvote, even if the sum is positive.
Note that some of the downvotes might have been cast before you edited the article, when it was even worse than it is now.