But my understanding is that there are also frequentist methods that have no reasonable Bayesian interpretation (for instance because they don’t satisfy coherence—http://en.wikipedia.org/wiki/Coherence_(philosophical_gambling_strategy)) but have a rigorous guarantee on performance. Unfortunately, I can’t think of any good examples off the top of my head, although Jordan gave SVMs as one; I don’t know enough about them to know if that is actually a reasonable example or not.
But my understanding is that there are also frequentist methods that have no reasonable Bayesian interpretation
This is impossible. See Searching for Bayes-Structure. It may be difficult to find a reasonable Bayesian interpretation, and it may only approximate said interpretation, but if it’s at all useful, it will have one.
It may be difficult to find a reasonable Bayesian interpretation, and it may only approximate said interpretation, but if it’s at all useful, it will have one.
Observation: This theory that you’ve stated here—that any useful frequentist method will have a Bayesian interpretation—doesn’t serve much in the way of controlled anticipation. Because there is so much flexibility in choosing priors and a loss function, the fact that “every useful frequentist method will be a Bayes method in disguise” doesn’t tell us much about what frequentist methods will turn out to be useful.
It seems to me that the wisdom to treat beliefs as anticipation controllers is more general, and I think more important, than the choice of Bayesian vs Frequentist inference methods. Each school has their own heuristic for quantifying this wisdom.
As for Bayesian vs Frequentist interpretations of what the word “probability” means, I think that’s a different (and sillier) debate.
This theory that you’ve stated here—that any useful frequentist method will have a Bayesian interpretation—doesn’t serve much in the way of controlled anticipation.
A frequentist tool only works insomuch as it approximates a Bayesian approach. As such, given the domain in which it works well, you can prove that it approximates the Bayesian answer.
For example, if you’re trying to find the probability of a repeatable event ending in success, the frequentist method says to use success/total. The Bayesain approach with a maximum entropy prior gives (success + 0.5)/(total + 1). It can be shown that, with a sufficient number of successes and failures, these will work out similarly. It’s well known that with very few successes or very few failures, the frequentist version doesn’t work very well.
This is false (as explained in the linked-to video). If nothing else, the frequentist answer depends on the loss function (as does the Bayesian answer, although the posterior distribution is a way of summarising the answer simultaneously for all loss functions).
I think you’re taking the frequentist interpretation of what a probability is and trying to forcibly extend it to the entire frequentist decision theory. As far as the “frequentist interpretation of probability” goes, I have never met a single statistician who even explicitly identified “probabilities as frequencies” as a possible belief to hold, much less claimed to hold it themselves. As far as I can tell, this whole “probabilities as frequencies” thing is unique to LessWrong.
Everyone I’ve ever met who identified as a frequentist meant “not strictly Bayesian”. Whenever a method was identified as frequentist, it either meant “not strictly Bayesian” or else that it was adopting the decision theory described in Michael Jordan’s lecture.
In fact, the frequentist approach (not as you’ve defined it, but as the term is actually used by statisticians) is used precisely because it works extremely well in certain circumstances (for instance, cross-validation). This is, I believe, what Mike is arguing for when he says that a mix of Bayesian and frequentist techniques is necessary.
Thanks for the link. That is a good point. I agree that every useful method has to have some amount of information-theoretic overlap with Bayes, but that overlap could be small and still be useful; we reach most conclusions only after there is overwhelming evidence in favor of them, so one could do as well as humans while only having a small amount of mutual information with proper Bayesian updating (or indeed without ever even working with a Bayesian model).
But my understanding is that there are also frequentist methods that have no reasonable Bayesian interpretation (for instance because they don’t satisfy coherence—http://en.wikipedia.org/wiki/Coherence_(philosophical_gambling_strategy)) but have a rigorous guarantee on performance. Unfortunately, I can’t think of any good examples off the top of my head, although Jordan gave SVMs as one; I don’t know enough about them to know if that is actually a reasonable example or not.
This is impossible. See Searching for Bayes-Structure. It may be difficult to find a reasonable Bayesian interpretation, and it may only approximate said interpretation, but if it’s at all useful, it will have one.
Observation: This theory that you’ve stated here—that any useful frequentist method will have a Bayesian interpretation—doesn’t serve much in the way of controlled anticipation. Because there is so much flexibility in choosing priors and a loss function, the fact that “every useful frequentist method will be a Bayes method in disguise” doesn’t tell us much about what frequentist methods will turn out to be useful.
It seems to me that the wisdom to treat beliefs as anticipation controllers is more general, and I think more important, than the choice of Bayesian vs Frequentist inference methods. Each school has their own heuristic for quantifying this wisdom.
As for Bayesian vs Frequentist interpretations of what the word “probability” means, I think that’s a different (and sillier) debate.
A frequentist tool only works insomuch as it approximates a Bayesian approach. As such, given the domain in which it works well, you can prove that it approximates the Bayesian answer.
For example, if you’re trying to find the probability of a repeatable event ending in success, the frequentist method says to use success/total. The Bayesain approach with a maximum entropy prior gives (success + 0.5)/(total + 1). It can be shown that, with a sufficient number of successes and failures, these will work out similarly. It’s well known that with very few successes or very few failures, the frequentist version doesn’t work very well.
This is false (as explained in the linked-to video). If nothing else, the frequentist answer depends on the loss function (as does the Bayesian answer, although the posterior distribution is a way of summarising the answer simultaneously for all loss functions).
I think you’re taking the frequentist interpretation of what a probability is and trying to forcibly extend it to the entire frequentist decision theory. As far as the “frequentist interpretation of probability” goes, I have never met a single statistician who even explicitly identified “probabilities as frequencies” as a possible belief to hold, much less claimed to hold it themselves. As far as I can tell, this whole “probabilities as frequencies” thing is unique to LessWrong.
Everyone I’ve ever met who identified as a frequentist meant “not strictly Bayesian”. Whenever a method was identified as frequentist, it either meant “not strictly Bayesian” or else that it was adopting the decision theory described in Michael Jordan’s lecture.
In fact, the frequentist approach (not as you’ve defined it, but as the term is actually used by statisticians) is used precisely because it works extremely well in certain circumstances (for instance, cross-validation). This is, I believe, what Mike is arguing for when he says that a mix of Bayesian and frequentist techniques is necessary.
Thanks for the link. That is a good point. I agree that every useful method has to have some amount of information-theoretic overlap with Bayes, but that overlap could be small and still be useful; we reach most conclusions only after there is overwhelming evidence in favor of them, so one could do as well as humans while only having a small amount of mutual information with proper Bayesian updating (or indeed without ever even working with a Bayesian model).