It’s also not clear to me that Bayesianism is correct in the ideal sense (or even what that means)
Interesting. Do you accept that by Cox’s theorems, probability theory is the normative theory of epistemology? Do you accept that a “bayesian” method based on explicitly approximating ideal probability theory will always give a more accurate answer? Do you accept that each of the examples above work because and to the extent that they (nonexplicitly) approximate the correct probability-theory answer (the bayes-structure argument)?
(as for how they do, we can put them in bayesian terms to see. Maximum liklihood methods assume a flat improper prior, and report the mode of the resulting probability distribution. We can immediately see that building in the prior disallows aggregation of different information sources. Only reporting the mode hides confidence interval and goes way off in the presence of skew. Also, we can’t apply safety factors sensibly (they involve utility calculation, which involves confidence intervals at the least).)
I don’t know much about SVM and PCA, but bayesian logistic regression is easy and superior to max liklihood for most things.
Do you accept that by Cox’s theorems, probability theory is the normative theory of epistemology?
Not Cox’s theorem, although the complete class theorem is more convincing (as well as dutch book arguments).
Do you accept that a “bayesian” method based on explicitly approximating ideal probability theory will always give a more accurate answer?
Only in the very weak sense that by the complete class theorem there exists a Bayesian method (or a limit of Bayesian methods) that does at least as well as whatever you’re doing. So sure, if you really had infinite computational resources then you could find such a method and use it...but I think that has almost no bearing on practice. Certainly I think there are many situations where a prior is unavailable.
Do you accept that each of the examples above work because and to the extent that they (nonexplicitly) approximate the correct probability-theory answer (the bayes-structure argument)?
Almost certainly not, although maybe we should taboo “because”. First of all, the “correct” probability-theory answer is not well-defined because the choice of both the prior and likelihood are both completely unconstrained. Secondly, I think the choice of whether to be Bayesian or frequentist is not nearly as important as e.g. the choice of likelihood function.
We can immediately see that building in the prior disallows aggregation of different information sources.
I don’t think the prior is what allows aggregation of different information sources, you can do transfer learning with vanilla logistic regression if you choose the right set of features.
Only reporting the mode hides confidence interval and goes way off in the presence of skew.
I agree with this although “being Bayesian” is neither necessary nor sufficient to deal with this (but would probably help on average).
Bayesian logistic regression is easy and superior to max liklihood for most things.
What do you mean by “Bayesian logistic regression”?
Can you recommend an explanation of the complete class theorem(s)? Preferably online. I’ve been googling pretty hard and I’ve turned up almost nothing. I’d like to understand what conditions they start from (suspecting that maybe the result is not quite as strong as “Bayes Rules!”). I’ve found only one paper, which basically said “what Wald proved is extremely difficult to understand, and probably not what you wanted.”
Maybe try this one? Let me know if that helps or if you’re looking for something different.
The complete class theorem states, informally: any Pareto optimal decision rule is a Bayesian decision rule (i.e. it can be obtained by choosing some prior, observing data, and then maximizing expected utility relative to the posterior).
Roughly, the argument is that if I have a collection W of possible worlds that I could be in, and a value U(w) to taking a particular action in world w, then any Pareto optimal strategy implicitly assigns an “importance” p(w) to each world, and takes the action that maximizes the sum of p(w)*U(w). We can then show that this is equivalent to using the Bayesian decision rule with p(w) as the prior over W. The main thing needed to formalize this argument is the separating hyperplane theorem, which is what the linked paper does.
Does the complete class theorem thus provide what Peterson (2004) and Easwaran (unpublished) think is missing in classical axiomatic decision theory: namely, a justification for choosing a prior, observing data, and then maximizing expected utility relative to the posterior?
Interesting. Do you accept that by Cox’s theorems, probability theory is the normative theory of epistemology? Do you accept that a “bayesian” method based on explicitly approximating ideal probability theory will always give a more accurate answer? Do you accept that each of the examples above work because and to the extent that they (nonexplicitly) approximate the correct probability-theory answer (the bayes-structure argument)?
(as for how they do, we can put them in bayesian terms to see. Maximum liklihood methods assume a flat improper prior, and report the mode of the resulting probability distribution. We can immediately see that building in the prior disallows aggregation of different information sources. Only reporting the mode hides confidence interval and goes way off in the presence of skew. Also, we can’t apply safety factors sensibly (they involve utility calculation, which involves confidence intervals at the least).)
I don’t know much about SVM and PCA, but bayesian logistic regression is easy and superior to max liklihood for most things.
Not Cox’s theorem, although the complete class theorem is more convincing (as well as dutch book arguments).
Only in the very weak sense that by the complete class theorem there exists a Bayesian method (or a limit of Bayesian methods) that does at least as well as whatever you’re doing. So sure, if you really had infinite computational resources then you could find such a method and use it...but I think that has almost no bearing on practice. Certainly I think there are many situations where a prior is unavailable.
Almost certainly not, although maybe we should taboo “because”. First of all, the “correct” probability-theory answer is not well-defined because the choice of both the prior and likelihood are both completely unconstrained. Secondly, I think the choice of whether to be Bayesian or frequentist is not nearly as important as e.g. the choice of likelihood function.
I don’t think the prior is what allows aggregation of different information sources, you can do transfer learning with vanilla logistic regression if you choose the right set of features.
I agree with this although “being Bayesian” is neither necessary nor sufficient to deal with this (but would probably help on average).
What do you mean by “Bayesian logistic regression”?
Can you recommend an explanation of the complete class theorem(s)? Preferably online. I’ve been googling pretty hard and I’ve turned up almost nothing. I’d like to understand what conditions they start from (suspecting that maybe the result is not quite as strong as “Bayes Rules!”). I’ve found only one paper, which basically said “what Wald proved is extremely difficult to understand, and probably not what you wanted.”
Thank you very much!
Maybe try this one? Let me know if that helps or if you’re looking for something different.
The complete class theorem states, informally: any Pareto optimal decision rule is a Bayesian decision rule (i.e. it can be obtained by choosing some prior, observing data, and then maximizing expected utility relative to the posterior).
Roughly, the argument is that if I have a collection W of possible worlds that I could be in, and a value U(w) to taking a particular action in world w, then any Pareto optimal strategy implicitly assigns an “importance” p(w) to each world, and takes the action that maximizes the sum of p(w)*U(w). We can then show that this is equivalent to using the Bayesian decision rule with p(w) as the prior over W. The main thing needed to formalize this argument is the separating hyperplane theorem, which is what the linked paper does.
Does the complete class theorem thus provide what Peterson (2004) and Easwaran (unpublished) think is missing in classical axiomatic decision theory: namely, a justification for choosing a prior, observing data, and then maximizing expected utility relative to the posterior?