The p-value does not in itself allow reasoning about the probabilities of hypotheses; this requires multiple hypotheses or a range of hypotheses, with a [prior distribution][1] of likelihoods between them, as in [Bayesian statistics][2], in which case one uses a [likelihood function][3] for all possible values of the prior, instead of the p-value for a single null hypothesis.
The p-value refers only to a single hypothesis, called the null hypothesis, and does not make reference to or allow conclusions about any other hypotheses, such as the [alternative hypothesis][4] in Neyman–Pearson [statistical hypothesis testing][5]. In that approach one instead has a decision function between two alternatives, often based on a [test statistic][6], and one computes the rate of [Type I and type II errors][7] as α and β. However, the p-value of a test statistic cannot be directly compared to these error rates α and β – instead it is fed into a decision function.
There are several common misunderstandings about p-values.[[16]][8][[17]][9]
The p-value is not the probability that the null hypothesis is true, nor is it the probability that the alternative hypothesis is false – it is not connected to either of these. In fact, [frequentist statistics][10] does not, and cannot, attach probabilities to hypotheses. Comparison of [Bayesian][11] and classical approaches shows that a p-value can be very close to zero while the [posterior probability][12] of the null is very close to unity (if there is no alternative hypothesis with a large enough a priori probability and which would explain the results more easily). This is [Lindley’s paradox][13]. But there are also a priori probability distributions where the [posterior probability][12] and the p-value have similar or equal values.[[18]][14]
The p-value is not the probability that a finding is “merely a fluke.” As the calculation of a p-value is based on the assumption that a finding is the product of chance alone, it patently cannot also be used to gauge the probability of that assumption being true. This is different from the real meaning which is that the p-value is the chance of obtaining such results if the null hypothesis is true.
The p-value is not the probability of falsely rejecting the null hypothesis. This error is a version of the so-called [prosecutor’s fallacy][15].
The p-value is not the probability that a replicating experiment would not yield the same conclusion. Quantifying the replicability of an experiment was attempted through the concept of [p-rep][16] (which is heavily [criticized][17])
The significance level, such as 0.05, is not determined by the p-value. Rather, the significance level is decided before the data are viewed, and is compared against the p-value, which is calculated after the test has been performed. (However, reporting a p-value is more useful than simply saying that the results were or were not significant at a given level, and allows readers to decide for themselves whether to consider the results significant.)
The p-value does not indicate the size or importance of the observed effect (compare with [effect size][18]). The two do vary together however – the larger the effect, the smaller sample size will be required to get a significant p-value.
Via http://www.scottbot.net/HIAL/?p=24697 I learned that Wikipedia actually has a good roundup of misunderstandings of p-values: