Wait—Bayesians can assign probabilities to things that are deterministic? What does that mean?
Absolutely!
The Bayesian philosophy is that probabilities are about states of knowledge. Probability is reasoning with incomplete information, not about whether an event is “deterministic”, as probabilities do still make sense in a completely deterministic universe. In a poker game, there are almost surely no quantum events influencing how the deck is shuffled. Classical mechanics, which is deterministic, suffices to predict the ordering of cards. Even so, we have neither sufficient initial conditions (on all the particles in the dealer’s body and brain, and any incoming signals), nor computational power to calculate the ordering of the cards. In this case, we can still use probability theory to figure out probabilities of various hand combinations that we can use to guide our betting. Incorporating knowledge of what cards I’ve been dealt, and what (if any) are public is straightforward. Incorporating player’s actions and reactions is much harder, and not really well enough defined that there is a mathematically correct answer, but clearly we should use that knowledge in determining what types of hands we think it likely for our opponents to have. If we count as the dealer shuffles, and see he only shuffled three or four times, in principle we can (given a reasonable mathematical model of shuffling, such as the one Diaconis constructed to give the result that 7 shuffles are needed to randomize a deck) use the correlations left in there to give us even more clues about opponents’ likely hands.
What would a Bayesian do instead of a T-test?
In most cases we’d step back, and ask what you were trying to do, such that a T-test seemed like a good idea.
For those unaware, a t-test is a way of calculating the “likelihood” for the null hypothesis, which measures how likely the data are given that model. If the data is even moderately compatible, Frequentists say “we can’t reject it”. If it is terribly unlikely, the Frequentists say that it can be rejected—that it’s worth looking at another model.
From a Bayesian perspective, this is somewhat backwards—we don’t really care how likely the data is given this model P(D|M) -- after all, we actually got the data. We effectively want to know how useful the model is, now that we know this data. Some simple consistency requirements and scaling constraints means that this usefulness has to act just like a probability. So let’s just call it the probability of the model, given the data: P(M|D). A small bit of algebra gives us that P(M|D) = P(D|M) * P(M)/P(D), where P(D) is the sum over all models i of P(D|M_i) P(M_i), and P(M_i) is some “prior probability” of each model—how useful we think that model would be, even without any data collected (But, importantly, with some background knowledge).
In this framework, we don’t have absolute objective levels of confidence in our theories. All that is absolute and objective is how the data should change our confidence in various theories. We can’t just reject a theory if the data don’t match well, unless we have a better alternative theory to which we can switch. In many cases these models can be continuously indexed, such that the index corresponds to a parameter in a unified model, then this becomes parameter estimation—we get a range of theories with probability densities instead of probabilities, or equivalently, one theory with a probability density on a parameter, and getting new data mechanically turns a crank to give us a new probability density on this parameter.
There are a couple unsatisfying bits here: First it really would be nice to say “this theory is ridiculous because it doesn’t explain the data” without any reference to any other theory. But if we know it’s the only theory in town, we don’t have a choice. If it’s not the only theory in town, then how useful it is can really only coherently be measured relative to how useful other theories are. Second, we need to give “prior probabilities” to our various theories, and the math doesn’t give any direct justifications for what these should be. However, as long as these aren’t crazy, the incoming data will continuously update these so that the ones that seem more useful will get weighted as more useful, and the ones that aren’t will get weighted as less useful. This of course means we need reasonable spaces of theories to work over, and we’ll only pick a good model if we have a good model in this space of theories. If you eventually realize that “hey, all these models are crappy”, there is no good way of expanding the set of models you’re willing to consider, though a common way is to just “start over” with an expanded model space, and reallocated prior probabilities. You can’t just pretend that the first analysis was over some subset of this analysis, because the rescaling due to the P(D) term depends on the set of models you have. (Though you can handwave that you weren’t actually calculating P(M_i|D), but P(M_i|D, {M}), the probability of each model given the data, assuming that it was one of these models).
A sometimes useful shortcut is rather than working directly with the probabilities, and hence needing the rescaling is to work with the likelihoods (or more tractably, the log of them). The difference of the log likelihoods of two different theories for some data is a reasonable measure of how much that data should effect their relative ranking. But any given likelihood by itself hasn’t much meaning—only in comparison to the rest in a set tells you anything useful.
“Usefulness” certainly isn’t the orthodox Bayesian phrasing. I call myself a Bayesian because I recognize that Bayes’s Rule is the right thing to use in these situations. Whether or not the probabilities assigned to hypotheses “actually are” probabilities (whatever that means), they should obey the same mathematical rules of calculation as probabilities.
But precisely because only the manipulation rules matter, I’m not sure it is worth emphasizing that “to be a good Bayesian” you must accord these probabilities the same status as other probabilities. A hardcore Frequentist is not going to be comfortable doing that. Heck, I’m not sure I’m comfortable doing that. Data and event probabilities are things that can eventually be “resolved” to true or false, by looking after the fact. Probability as plausibility makes sense for these things.
But for hypotheses and models, I ask myself “plausibility of what? Being true?” Almost certainly, the “real” model (when that even makes sense) isn’t in our space of models. For example, a common, almost necessary, assumption is exchangeability: that any given permutation of the data is equally likely—effectively that all data points are drawn from the same distribution. Data often doesn’t behave like that, instead having a time drift. Coins being tossed develop wear, cards being shuffled and dealt get bent.
I really do prefer to think of some models being more or less useful. Of course, following this path shades into decision theory: we might want to assign priors according to how “tractable” the models are, including both in specification (stupid models that just specify what the data will be take lots of specification, so should have lower initial probabilities). Models that take longer to compute data probabilities should similarly have a probability penalty, not simply because they’re implausible, but because we don’t want to use them unless the data force us to.
...shades into decision theory...Models that take longer to compute data probabilities should similarly have a probability penalty, not simply because they’re implausible, but because we don’t want to use them unless the data force us to.
Whoa! that sounds dangerous! Why not keep the beliefs and costs separate and only apply this penalty at the decision theory stage?
Well, I said shaded into the lines of decision theory...
Yes, it absolutely is dangerous, and thinking about it more I agree it should not be done this way. Probability penalties do not scale correctly with the data collected: they’re essentially just a fixed offset. Modified utility of using a particular method really is different. If a method is unusable, we shouldn’t use it, and methods that trade off accuracy for manageability should be decided at that level, once we can judge the accuracy—not earlier.
EDIT: I suppose I was hoping for a valid way of justifying the fact that we throw out models that are too hard to use or analyze—they never make it into our set of hypotheses in the first place. It’s amazing how often conjugate priors “just happen” to be chosen...
But for hypotheses and models, I ask myself “plausibility of what? Being true?”
Plausibility of being true given the prior information. Just as Aristotelian logic gives valid arguments (but not necessarily sound ones), Bayes’s theorem gives valid but not necessarily sound plausibility assessments.
following this path shades into decision theory
That’s pretty much why I wanted to make the distinction between plausibility and usefulness. One of the things I like about the Cox-Jaynes approach is that it cleanly splits inference and decision-making apart.
Plausibility of being true given the prior information.
Okay, sure we can go back to the Bayesian mantra of “all probabilities are conditional probabilities”. But our prior information effectively includes the statement that one of our models is the “true one”. And that’s never the actual case, so our arguments are never sound in this sense, because we are forced to work from prior information that isn’t true. This isn’t a huge problem, but it in some sense undermines the motivation for finding these probabilities and treating them seriously—they’re conditional probabilities being applied in a case where we know that what is being conditioned on is false. What is the grounding to our actual situation? I like to take the stance that in practice this is still useful—as an approximation procedure—sorting through models that are approximately right.
And that’s never the actual case, so our arguments are never sound in this sense, because we are forced to work from prior information that isn’t true.
One does generally resort to non-Bayesian model checking methods. Andrew Gelman likes to include such checks under the rubric of “Bayesian data analysis”; he calls the computing of posterior probabilities and densities “Bayesian inference”, a preceding subcomponent of Bayesian data analysis. This makes for sensible statistical practice, but the underpinnings aren’t strong. One might consider it an attempt to approximate the Solomonoff prior.
Yes, in practice people resort to less motivated methods that work well.
I’d really like to see some principled answer that has the same feel as Bayesianism though. As it stands, I have no problem using Bayesian methods for parameter estimation. This is natural because we really are getting pdf(parameters | data, model). But for model selection and evaluation (i.e. non-parametric Bayes) I always feel that I need an “escape hatch” to include new models that the Bayes formalism simply doesn’t have any place for.
Models that take longer to compute data probabilities should similarly have a probability penalty, not simply because they’re implausible, but because we don’t want to use them unless the data force us to.
I am much more comfortable leaving probability as it is but using a different term for usefulness.
Absolutely!
The Bayesian philosophy is that probabilities are about states of knowledge. Probability is reasoning with incomplete information, not about whether an event is “deterministic”, as probabilities do still make sense in a completely deterministic universe. In a poker game, there are almost surely no quantum events influencing how the deck is shuffled. Classical mechanics, which is deterministic, suffices to predict the ordering of cards. Even so, we have neither sufficient initial conditions (on all the particles in the dealer’s body and brain, and any incoming signals), nor computational power to calculate the ordering of the cards. In this case, we can still use probability theory to figure out probabilities of various hand combinations that we can use to guide our betting. Incorporating knowledge of what cards I’ve been dealt, and what (if any) are public is straightforward. Incorporating player’s actions and reactions is much harder, and not really well enough defined that there is a mathematically correct answer, but clearly we should use that knowledge in determining what types of hands we think it likely for our opponents to have. If we count as the dealer shuffles, and see he only shuffled three or four times, in principle we can (given a reasonable mathematical model of shuffling, such as the one Diaconis constructed to give the result that 7 shuffles are needed to randomize a deck) use the correlations left in there to give us even more clues about opponents’ likely hands.
In most cases we’d step back, and ask what you were trying to do, such that a T-test seemed like a good idea.
For those unaware, a t-test is a way of calculating the “likelihood” for the null hypothesis, which measures how likely the data are given that model. If the data is even moderately compatible, Frequentists say “we can’t reject it”. If it is terribly unlikely, the Frequentists say that it can be rejected—that it’s worth looking at another model.
From a Bayesian perspective, this is somewhat backwards—we don’t really care how likely the data is given this model P(D|M) -- after all, we actually got the data. We effectively want to know how useful the model is, now that we know this data. Some simple consistency requirements and scaling constraints means that this usefulness has to act just like a probability. So let’s just call it the probability of the model, given the data: P(M|D). A small bit of algebra gives us that P(M|D) = P(D|M) * P(M)/P(D), where P(D) is the sum over all models i of P(D|M_i) P(M_i), and P(M_i) is some “prior probability” of each model—how useful we think that model would be, even without any data collected (But, importantly, with some background knowledge).
In this framework, we don’t have absolute objective levels of confidence in our theories. All that is absolute and objective is how the data should change our confidence in various theories. We can’t just reject a theory if the data don’t match well, unless we have a better alternative theory to which we can switch. In many cases these models can be continuously indexed, such that the index corresponds to a parameter in a unified model, then this becomes parameter estimation—we get a range of theories with probability densities instead of probabilities, or equivalently, one theory with a probability density on a parameter, and getting new data mechanically turns a crank to give us a new probability density on this parameter.
There are a couple unsatisfying bits here:
First it really would be nice to say “this theory is ridiculous because it doesn’t explain the data” without any reference to any other theory. But if we know it’s the only theory in town, we don’t have a choice. If it’s not the only theory in town, then how useful it is can really only coherently be measured relative to how useful other theories are.
Second, we need to give “prior probabilities” to our various theories, and the math doesn’t give any direct justifications for what these should be. However, as long as these aren’t crazy, the incoming data will continuously update these so that the ones that seem more useful will get weighted as more useful, and the ones that aren’t will get weighted as less useful. This of course means we need reasonable spaces of theories to work over, and we’ll only pick a good model if we have a good model in this space of theories. If you eventually realize that “hey, all these models are crappy”, there is no good way of expanding the set of models you’re willing to consider, though a common way is to just “start over” with an expanded model space, and reallocated prior probabilities. You can’t just pretend that the first analysis was over some subset of this analysis, because the rescaling due to the P(D) term depends on the set of models you have. (Though you can handwave that you weren’t actually calculating P(M_i|D), but P(M_i|D, {M}), the probability of each model given the data, assuming that it was one of these models).
A sometimes useful shortcut is rather than working directly with the probabilities, and hence needing the rescaling is to work with the likelihoods (or more tractably, the log of them). The difference of the log likelihoods of two different theories for some data is a reasonable measure of how much that data should effect their relative ranking. But any given likelihood by itself hasn’t much meaning—only in comparison to the rest in a set tells you anything useful.
Very nice! I’d only replace “useful” with “plausible”. (Sure, it’s hard to define plausibility, but usefulness is not really the right concept.)
“Usefulness” certainly isn’t the orthodox Bayesian phrasing. I call myself a Bayesian because I recognize that Bayes’s Rule is the right thing to use in these situations. Whether or not the probabilities assigned to hypotheses “actually are” probabilities (whatever that means), they should obey the same mathematical rules of calculation as probabilities.
But precisely because only the manipulation rules matter, I’m not sure it is worth emphasizing that “to be a good Bayesian” you must accord these probabilities the same status as other probabilities. A hardcore Frequentist is not going to be comfortable doing that. Heck, I’m not sure I’m comfortable doing that. Data and event probabilities are things that can eventually be “resolved” to true or false, by looking after the fact. Probability as plausibility makes sense for these things.
But for hypotheses and models, I ask myself “plausibility of what? Being true?” Almost certainly, the “real” model (when that even makes sense) isn’t in our space of models. For example, a common, almost necessary, assumption is exchangeability: that any given permutation of the data is equally likely—effectively that all data points are drawn from the same distribution. Data often doesn’t behave like that, instead having a time drift. Coins being tossed develop wear, cards being shuffled and dealt get bent.
I really do prefer to think of some models being more or less useful. Of course, following this path shades into decision theory: we might want to assign priors according to how “tractable” the models are, including both in specification (stupid models that just specify what the data will be take lots of specification, so should have lower initial probabilities). Models that take longer to compute data probabilities should similarly have a probability penalty, not simply because they’re implausible, but because we don’t want to use them unless the data force us to.
Whoa! that sounds dangerous! Why not keep the beliefs and costs separate and only apply this penalty at the decision theory stage?
Well, I said shaded into the lines of decision theory...
Yes, it absolutely is dangerous, and thinking about it more I agree it should not be done this way. Probability penalties do not scale correctly with the data collected: they’re essentially just a fixed offset. Modified utility of using a particular method really is different. If a method is unusable, we shouldn’t use it, and methods that trade off accuracy for manageability should be decided at that level, once we can judge the accuracy—not earlier.
EDIT: I suppose I was hoping for a valid way of justifying the fact that we throw out models that are too hard to use or analyze—they never make it into our set of hypotheses in the first place. It’s amazing how often conjugate priors “just happen” to be chosen...
Plausibility of being true given the prior information. Just as Aristotelian logic gives valid arguments (but not necessarily sound ones), Bayes’s theorem gives valid but not necessarily sound plausibility assessments.
That’s pretty much why I wanted to make the distinction between plausibility and usefulness. One of the things I like about the Cox-Jaynes approach is that it cleanly splits inference and decision-making apart.
Okay, sure we can go back to the Bayesian mantra of “all probabilities are conditional probabilities”. But our prior information effectively includes the statement that one of our models is the “true one”. And that’s never the actual case, so our arguments are never sound in this sense, because we are forced to work from prior information that isn’t true. This isn’t a huge problem, but it in some sense undermines the motivation for finding these probabilities and treating them seriously—they’re conditional probabilities being applied in a case where we know that what is being conditioned on is false. What is the grounding to our actual situation? I like to take the stance that in practice this is still useful—as an approximation procedure—sorting through models that are approximately right.
One does generally resort to non-Bayesian model checking methods. Andrew Gelman likes to include such checks under the rubric of “Bayesian data analysis”; he calls the computing of posterior probabilities and densities “Bayesian inference”, a preceding subcomponent of Bayesian data analysis. This makes for sensible statistical practice, but the underpinnings aren’t strong. One might consider it an attempt to approximate the Solomonoff prior.
Yes, in practice people resort to less motivated methods that work well.
I’d really like to see some principled answer that has the same feel as Bayesianism though. As it stands, I have no problem using Bayesian methods for parameter estimation. This is natural because we really are getting pdf(parameters | data, model). But for model selection and evaluation (i.e. non-parametric Bayes) I always feel that I need an “escape hatch” to include new models that the Bayes formalism simply doesn’t have any place for.
I feel the same way.
I am much more comfortable leaving probability as it is but using a different term for usefulness.