I asked my professor, “But don’t we want to know the probability of the hypothesis we’re testing given the data, not the other way around?” The reply was something about how this was the best we could do.
One senses that the author (the one in the student role) neither has understood the relative-frequency theory of probability nor has performed any empirical research using statistics—lending the essay the tone of an arrogant neophyte. The same perhaps for the professor. (Which institution is on report here?) Frequentists reject the very concept of “the probability of the theory given the data.” They take probabilities to be objective, so they think it a category error to remark about the probability of a theory: the theory is either true or false, and probability has nothing to do with it.
You can reject relative-frequentism (I do), but you can’t successfully understand it in Bayesian terms. As a first approximation, it may be better understood in falsificationist terms. (Falsificationism keeps getting trotted out by Bayesians, but that construct has no place in a Bayesian account. These confusions are embarrassingly amateurish.) The Fischer paradigm is that you want to show that a variable made a real difference—that what you discovered wasn’t due to chance. However, there’s always the possibility that chance intervened, so the experimenter settles for a low probability that chance alone was responsible for the result. If the probability (the p value) is low enough, you treat it as sufficiently unlikely not to be worth worrying about, and you can reject the hypothesis that the variable made no difference.
If, like I, you think it makes sense to speak of subjective probabilities (whether exclusively or along with objective probabilities), you will usually find an estimate of the probabilities of the hypothesis given the data, as generated by Bayesian analysis, more useful. That doesn’t mean it’s easy or even possible to do a Bayesian analysis that will be acceptable to other scientists. To get subjective probabilities out, you must put subjective probabilities in. Often the worry is said to be the infamous problem of estimating priors, but in practice the likelihood ratios are more troublesome.
Let’s say I’m doing a study of the effect of arrogance on a neophyte’s confidence that he knows how to fix science. I develop and norm a test of Arrogance/Narcissism and also an inventory of how strongly held a subject’s views are in the philosophy of science and the theory of evidence. I divide the subjects in two groups according to whether they fall above or below the A/N median. I then use Fischerian methods to determine whether there’s an above-chance level of unwarranted smugness among the high A/N group. Easy enough, but limited. It doesn’t tell me what I most want to know, how much credence should I put in the results. I’ve shown there’s evidence for an effect, but there’s always evidence for some effect: the null hypothesis, strictly speaking, is always false. No two entities outside of fundamental physics are exactly the same.
Bayesian analysis promises more, but whereas other scientists will respect my crude frequentist analysis as such—although many will denigrate its real significance—many will reject my Bayesian analysis out of hand due to what must go into it. Let’s consider just one of the factors that must enter the Bayesian analysis. I must estimate the probability that that the ‘high-Arrogance’ subjects will score higher on Smugness if my theory is wrong, that is, if arrogance really has no effect on Smugness. Certainly my Arrogance/Narcissism test doesn’t measure the intended construct without impurities. I must estimate the probability that all the impurities combined or any of them confound the results. Maybe high-Arrogant scorers are dumber in addition to being more arrogant, and that is what’s responsible for some of the correlation. Somehow, I must come up with a responsible way to estimate the probability of getting my results if Arrogance had nothing to do with Smugness. Perhaps I can make an informed approximation, but it will be unlikely to dovetail with the estimates of other scientists. Soon we’ll be arguing about my assumptions—and what we’ll be doing will be more like philosophy than empirical science.
The lead essay provides a biased picture of the advantages of Bayesian methods by completely ignoring its problems. A poor diet for budding rationalists.
Frequentists reject the very concept of “the probability of the theory given the data.” They take probabilities to be objective, so they think it a category error to remark about the probability of a theory:
Then they should also reject the very concept of “the probability of the data given the theory”, since that quantity has “the probability of the theory” explicitly in the denominator.
Then they should also reject the very concept of “the probability of the data given the theory”, since that quantity has “the probability of the theory” explicitly in the denominator.
You are reading “the probability of the data D given the theory T” to mean p(D | T), which in turn is short for a ratio p(D & T)/p(T) of probabilities with respect to some universal prior p. But, for the frequentist, there is no universal prior p being invoked.
Rather, each theory comes with its own probability distribution p_T over data, and “the probability of the data D given the theory T” just means p_T(D). The different distributions provided by different theories don’t have any relationship with one another. In particular, the different distributions are not the result of conditioning on a common prior. They are incommensurable, so to speak.
The different theories are just more or less correct. There is a “true” probability of the data, which describes the objective propensity of reality to yield those data. The different distributions from the different theories are comparable only in the sense that they each get that true distribution more or less right.
Do you believe that elan vital explains the mysterious aliveness of living beings? Then what does this belief not allow to happen—what would definitelyfalsify this belief? (emphasis added) — Making Beliefs Pay Rent (in Anticipated Experiences
It would be more accurate to say that LW-style Bayesians consider falsificationism to be subsumed under Bayesianism as a sort of limiting case. Falsificationism as originally stated (ie, confirmations are irrelevant; only falsifications advance knowledge) is an exaggerated version of a mathematically valid claim. From An Intuitive Explanation of Bayes’ Theorem:
Previously, the most popular philosophy of science was probably Karl Popper’s falsificationism—this is the old philosophy that the Bayesian revolution is currently dethroning. Karl Popper’s idea that theories can be definitely falsified, but never definitely confirmed, is yet another special case of the Bayesian rules; if p(X|A) ~ 1 - if the theory makes a definite prediction—then observing ~X very strongly falsifies A. On the other hand, if p(X|A) ~ 1, and we observe X, this doesn’t definitely confirm the theory; there might be some other condition B such that p(X|B) ~ 1, in which case observing X doesn’t favor A over B. For observing X to definitely confirm A, we would have to know, not that p(X|A) ~ 1, but that p(X|~A) ~ 0, which is something that we can’t know because we can’t range over all possible alternative explanations. For example, when Einstein’s theory of General Relativity toppled Newton’s incredibly well-confirmed theory of gravity, it turned out that all of Newton’s predictions were just a special case of Einstein’s predictions.
You can even formalize Popper’s philosophy mathematically. The likelihood ratio for X, p(X|A)/p(X|~A), determines how much observing X slides the probability for A; the likelihood ratio is what says how strong X is as evidence. Well, in your theory A, you can predict X with probability 1, if you like; but you can’t control the denominator of the likelihood ratio, p(X|~A) - there will always be some alternative theories that also predict X, and while we go with the simplest theory that fits the current evidence, you may someday encounter some evidence that an alternative theory predicts but your theory does not. That’s the hidden gotcha that toppled Newton’s theory of gravity. So there’s a limit on how much mileage you can get from successful predictions; there’s a limit on how high the likelihood ratio goes for confirmatory evidence.
[i]n your theory A, you can predict X with probability 1...
This seems the key step for incorporating falsification as a limiting case; I contest it. The rules of Bayesian rationality preclude assigning an a priori probability of 1 to a synthetic proposition: nothing empirical is so certain that refuting evidence is impossible. (Isthat assertion self-undermining? I hope that worry can be bracketed.) As long as you avoid assigning probabilities of 1 or 0 to priors, you will never get an outcome at those extremes.
But since P(X/A) is always “intermediate,” observing X will never strictly falsify A—which is a good thing because the falsification prong of Popperianism has proven at least as scientifically problematic as the nonverification prong.
I don’t think falsification can be squared with Bayes, even as a limiting case. In Basesian theory, verification and falsification are symmetric (as the slider metaphor really indicates). In principle, you can’t strictly falsify a theory empirically any more (or less) than you can verify one. Verification, as the quoted essay confirms, is blocked by the > 0 probability mandatorily assigned to unpredicted outcomes; falsification is blocked by the < 1 probability mandatorily assigned to the expected results. It is no less irrational to be certain that X holds given A than to be certain that X fails given not-A. You are no more justified in assuming absolutely that your abstractions don’t leak than in assuming you can range over all explanations.
In principle, you can’t strictly falsify a theory empirically any more (or less) than you can verify one.
This throws the baby out with the bathwater; we can falsify and verify to degrees. Refusing the terms verify and falsify because we are not able to assign infinite credence seems like a mistake.
This throws the baby out with the bathwater; we can falsify and verify to degrees. Refusing the terms verify and falsify because we are not able to assign infinite credence seems like a mistake.
I agree; that’s why “strictly.” But you seem to miss the point, which is that falsification and verification are perfectly symmetric: whether you call the glass half empty or half full on either side of the equation wasn’t my concern.
Two basic criticisms apply to Popperian falsificationism: 1) it ignores verification (although the “verisimilitude” doctrine tries to overcome this limitation); and 2) it does assign infinite credence to falsification.
No. 2 doesn’t comport with the principles of Bayesian inference, but seems part of LW Bayesianism (your term):
This allowance of a unitary probability assignment to evidence conditional on a theory is a distortion of Bayesian inference. The distortion introduces an artificial asymmetry into the Bayesian handling of verification versus falsification. It is irrational to pretend—even conditionally—to absolute certainty about an empirical prediction.
[i]n your theory A, you can predict X with probability 1...
[...] The rules of Bayesian rationality preclude assigning an a priori probability of 1 to a synthetic proposition: nothing empirical is so certain that refuting evidence is impossible.
We all agree on this point. Yudkowsky isn’t supposing that anything empirical has probability 1.
In the line you quote, Yudkowsky is saying that even if theory A predicts data X with probability 1 (setting aside the question of whether this is even possible), confirming that X is true still wouldn’t push our confidence in the truth of A past a certain threshold, which might be far short of 1. (In particular, merely confirming a prediction X of A can never push the posterior probability of A above p(A|X), which might still be too small because too many alternative theories also predict X). A falsification, on the other hand, can drive the probability of a theory very low, provided that the theory makes some prediction with high confidence (which needn’t be equal to 1) that has a low prior probability.
That is the sense in which it is true that falsifications tend to be more decisive than confirmations. So, a certain limited and “caveated”, but also more precise and quantifiable, version of Popper’s falsificationism is correct.
But since P(X/A) is always “intermediate,” observing X will never strictly falsify A—which is a good thing because the falsification prong of Popperianism has proven at least as scientifically problematic as the nonverification prong.
Yes, no observation will drive the probability of a theory down to precisely 0. The probability can only be driven very low. That is why I called falsificationism an “an exaggerated version of a mathematically valid claim”.
I don’t think falsification can be squared with Bayes, even as a limiting case. In Basesian theory, verification and falsification are symmetric (as the slider metaphor really indicates). In principle, you can’t strictly falsify a theory empirically any more (or less) than you can verify one. Verification, as the quoted essay confirms, is blocked by the > 0 probability mandatorily assigned to unpredicted outcomes; falsification is blocked by the < 1 probability mandatorily assigned to the expected results. It is no less irrational to be certain that X holds given A than to be certain that X fails given not-A. You are no more justified in assuming absolutely that your abstractions don’t leak than in assuming you can range over all explanations.
As you say, getting to probability 0 is as impossible as getting to probability 1. But getting close to probability 0 is easier than getting equally close to probability 1.
This asymmetry is possible because different kinds of propositions are more or less amenable to being assigned extremely high or low probability. It is relatively easier to show that some data has extremely high or low probability (whether conditional on some theory or a priori) than it is to show that some theory has extremely high conditional probability.
Fix a theory A. It is very hard to think up an experiment with a possible outcome X such that p(A | X) is nearly 1. To do this, you would need to show that no other possible theory, even among the many theories you haven’t thought of, could have a significant amount of probability, conditional on observing X.
It is relatively easy to think up an experiment with a possible outcome X, which your theory A predicts with very high probability, but which has very low prior probability. To accomplish this, you only need to exhibit some other a priori plausible outcomes different from X.
In the second case, you need to show that the probability of some data is extremely high a posteriori and extremely low a priori. In the first case, you need to show that the a posteriori probability of a theory is extremely high.
In the second case, you only need to construct enough alternative outcomes to certify your claim. In the first case, you need to prove a universal statement about all possible theories.
One root of the asymmetry is this: As hard as it might be to establish extreme probabilities for data, at least the data usually come from a reasonably well-understood parameter space (the real numbers, say). But the space of all possible theories is not well understood, at least not in any computationally tractable way.
In the second case, you only need to construct enough alternative outcomes to certify your claim. In the first case, you need to prove a universal statement about all possible theories.
All these arguments are at best suggestive. Our abductive capacities are such as to suggest that proving a universal statement about all possible theories isn’t necessarily hard. Your arguments, I think, flow from and then confirm a nominalistic bias: accept concrete data; beware of general theories.
There are universal statements known with greater certainly than any particular data, e.g., life evolved from inanimate matter and mind always supervenes on physics.
some universal statements about all theories are very probable, and that
some of our theories are more probable than any particular data.
I’m not seeing why either of these facts are in tension with my previous comment. Would you elaborate?
The claims I made are true of certain priors. I’m not trying to argue you into using such a prior. Right now I only want to make the points that (1) a Bayesian can coherently use a prior satisfying the properties I described, and that (2) falsificationism is true, in a weakened but precise sense, under such a prior.
One senses that the author (the one in the student role) neither has understood the relative-frequency theory of probability nor has performed any empirical research using statistics—lending the essay the tone of an arrogant neophyte. The same perhaps for the professor. (Which institution is on report here?) Frequentists reject the very concept of “the probability of the theory given the data.” They take probabilities to be objective, so they think it a category error to remark about the probability of a theory: the theory is either true or false, and probability has nothing to do with it.
You can reject relative-frequentism (I do), but you can’t successfully understand it in Bayesian terms. As a first approximation, it may be better understood in falsificationist terms. (Falsificationism keeps getting trotted out by Bayesians, but that construct has no place in a Bayesian account. These confusions are embarrassingly amateurish.) The Fischer paradigm is that you want to show that a variable made a real difference—that what you discovered wasn’t due to chance. However, there’s always the possibility that chance intervened, so the experimenter settles for a low probability that chance alone was responsible for the result. If the probability (the p value) is low enough, you treat it as sufficiently unlikely not to be worth worrying about, and you can reject the hypothesis that the variable made no difference.
If, like I, you think it makes sense to speak of subjective probabilities (whether exclusively or along with objective probabilities), you will usually find an estimate of the probabilities of the hypothesis given the data, as generated by Bayesian analysis, more useful. That doesn’t mean it’s easy or even possible to do a Bayesian analysis that will be acceptable to other scientists. To get subjective probabilities out, you must put subjective probabilities in. Often the worry is said to be the infamous problem of estimating priors, but in practice the likelihood ratios are more troublesome.
Let’s say I’m doing a study of the effect of arrogance on a neophyte’s confidence that he knows how to fix science. I develop and norm a test of Arrogance/Narcissism and also an inventory of how strongly held a subject’s views are in the philosophy of science and the theory of evidence. I divide the subjects in two groups according to whether they fall above or below the A/N median. I then use Fischerian methods to determine whether there’s an above-chance level of unwarranted smugness among the high A/N group. Easy enough, but limited. It doesn’t tell me what I most want to know, how much credence should I put in the results. I’ve shown there’s evidence for an effect, but there’s always evidence for some effect: the null hypothesis, strictly speaking, is always false. No two entities outside of fundamental physics are exactly the same.
Bayesian analysis promises more, but whereas other scientists will respect my crude frequentist analysis as such—although many will denigrate its real significance—many will reject my Bayesian analysis out of hand due to what must go into it. Let’s consider just one of the factors that must enter the Bayesian analysis. I must estimate the probability that that the ‘high-Arrogance’ subjects will score higher on Smugness if my theory is wrong, that is, if arrogance really has no effect on Smugness. Certainly my Arrogance/Narcissism test doesn’t measure the intended construct without impurities. I must estimate the probability that all the impurities combined or any of them confound the results. Maybe high-Arrogant scorers are dumber in addition to being more arrogant, and that is what’s responsible for some of the correlation. Somehow, I must come up with a responsible way to estimate the probability of getting my results if Arrogance had nothing to do with Smugness. Perhaps I can make an informed approximation, but it will be unlikely to dovetail with the estimates of other scientists. Soon we’ll be arguing about my assumptions—and what we’ll be doing will be more like philosophy than empirical science.
The lead essay provides a biased picture of the advantages of Bayesian methods by completely ignoring its problems. A poor diet for budding rationalists.
Then they should also reject the very concept of “the probability of the data given the theory”, since that quantity has “the probability of the theory” explicitly in the denominator.
You are reading “the probability of the data D given the theory T” to mean p(D | T), which in turn is short for a ratio p(D & T)/p(T) of probabilities with respect to some universal prior p. But, for the frequentist, there is no universal prior p being invoked.
Rather, each theory comes with its own probability distribution p_T over data, and “the probability of the data D given the theory T” just means p_T(D). The different distributions provided by different theories don’t have any relationship with one another. In particular, the different distributions are not the result of conditioning on a common prior. They are incommensurable, so to speak.
The different theories are just more or less correct. There is a “true” probability of the data, which describes the objective propensity of reality to yield those data. The different distributions from the different theories are comparable only in the sense that they each get that true distribution more or less right.
Not LessWronger Bayesians, in my experience.
What about:
It would be more accurate to say that LW-style Bayesians consider falsificationism to be subsumed under Bayesianism as a sort of limiting case. Falsificationism as originally stated (ie, confirmations are irrelevant; only falsifications advance knowledge) is an exaggerated version of a mathematically valid claim. From An Intuitive Explanation of Bayes’ Theorem:
This seems the key step for incorporating falsification as a limiting case; I contest it. The rules of Bayesian rationality preclude assigning an a priori probability of 1 to a synthetic proposition: nothing empirical is so certain that refuting evidence is impossible. (Isthat assertion self-undermining? I hope that worry can be bracketed.) As long as you avoid assigning probabilities of 1 or 0 to priors, you will never get an outcome at those extremes.
But since P(X/A) is always “intermediate,” observing X will never strictly falsify A—which is a good thing because the falsification prong of Popperianism has proven at least as scientifically problematic as the nonverification prong.
I don’t think falsification can be squared with Bayes, even as a limiting case. In Basesian theory, verification and falsification are symmetric (as the slider metaphor really indicates). In principle, you can’t strictly falsify a theory empirically any more (or less) than you can verify one. Verification, as the quoted essay confirms, is blocked by the > 0 probability mandatorily assigned to unpredicted outcomes; falsification is blocked by the < 1 probability mandatorily assigned to the expected results. It is no less irrational to be certain that X holds given A than to be certain that X fails given not-A. You are no more justified in assuming absolutely that your abstractions don’t leak than in assuming you can range over all explanations.
This throws the baby out with the bathwater; we can falsify and verify to degrees. Refusing the terms verify and falsify because we are not able to assign infinite credence seems like a mistake.
I agree; that’s why “strictly.” But you seem to miss the point, which is that falsification and verification are perfectly symmetric: whether you call the glass half empty or half full on either side of the equation wasn’t my concern.
Two basic criticisms apply to Popperian falsificationism: 1) it ignores verification (although the “verisimilitude” doctrine tries to overcome this limitation); and 2) it does assign infinite credence to falsification.
No. 2 doesn’t comport with the principles of Bayesian inference, but seems part of LW Bayesianism (your term):
This allowance of a unitary probability assignment to evidence conditional on a theory is a distortion of Bayesian inference. The distortion introduces an artificial asymmetry into the Bayesian handling of verification versus falsification. It is irrational to pretend—even conditionally—to absolute certainty about an empirical prediction.
We all agree on this point. Yudkowsky isn’t supposing that anything empirical has probability 1.
In the line you quote, Yudkowsky is saying that even if theory A predicts data X with probability 1 (setting aside the question of whether this is even possible), confirming that X is true still wouldn’t push our confidence in the truth of A past a certain threshold, which might be far short of 1. (In particular, merely confirming a prediction X of A can never push the posterior probability of A above p(A|X), which might still be too small because too many alternative theories also predict X). A falsification, on the other hand, can drive the probability of a theory very low, provided that the theory makes some prediction with high confidence (which needn’t be equal to 1) that has a low prior probability.
That is the sense in which it is true that falsifications tend to be more decisive than confirmations. So, a certain limited and “caveated”, but also more precise and quantifiable, version of Popper’s falsificationism is correct.
Yes, no observation will drive the probability of a theory down to precisely 0. The probability can only be driven very low. That is why I called falsificationism an “an exaggerated version of a mathematically valid claim”.
As you say, getting to probability 0 is as impossible as getting to probability 1. But getting close to probability 0 is easier than getting equally close to probability 1.
This asymmetry is possible because different kinds of propositions are more or less amenable to being assigned extremely high or low probability. It is relatively easier to show that some data has extremely high or low probability (whether conditional on some theory or a priori) than it is to show that some theory has extremely high conditional probability.
Fix a theory A. It is very hard to think up an experiment with a possible outcome X such that p(A | X) is nearly 1. To do this, you would need to show that no other possible theory, even among the many theories you haven’t thought of, could have a significant amount of probability, conditional on observing X.
It is relatively easy to think up an experiment with a possible outcome X, which your theory A predicts with very high probability, but which has very low prior probability. To accomplish this, you only need to exhibit some other a priori plausible outcomes different from X.
In the second case, you need to show that the probability of some data is extremely high a posteriori and extremely low a priori. In the first case, you need to show that the a posteriori probability of a theory is extremely high.
In the second case, you only need to construct enough alternative outcomes to certify your claim. In the first case, you need to prove a universal statement about all possible theories.
One root of the asymmetry is this: As hard as it might be to establish extreme probabilities for data, at least the data usually come from a reasonably well-understood parameter space (the real numbers, say). But the space of all possible theories is not well understood, at least not in any computationally tractable way.
All these arguments are at best suggestive. Our abductive capacities are such as to suggest that proving a universal statement about all possible theories isn’t necessarily hard. Your arguments, I think, flow from and then confirm a nominalistic bias: accept concrete data; beware of general theories.
There are universal statements known with greater certainly than any particular data, e.g., life evolved from inanimate matter and mind always supervenes on physics.
I agree that
some universal statements about all theories are very probable, and that
some of our theories are more probable than any particular data.
I’m not seeing why either of these facts are in tension with my previous comment. Would you elaborate?
The claims I made are true of certain priors. I’m not trying to argue you into using such a prior. Right now I only want to make the points that (1) a Bayesian can coherently use a prior satisfying the properties I described, and that (2) falsificationism is true, in a weakened but precise sense, under such a prior.