Thanks a lot for the thoughtful comment. I’ve included some of my own thoughts below / also some clarifications.
First, for the non-probabilistically guaranteed methods: these are relatively few and far between
Do you think that online learning methods count as an example of this?
when strong prior information exists but is difficult to incorporate into the method, the results of the method can “leave money on the table”, as it were
I think this is a valid objection, but I’ll make two partial counter-arguments. The first is that, arguably, there may be some information that is not easy to incorporate as a prior but is easy to incorporate under some sort of minimax formalism. So Bayes may be forced to leave money on the table in the same way.
A more concrete response is that, often, an appropriate regularizer can incorporate similar information to what a prior would incorporate. I think the regularizer that I exhibited in Myth 6 is one example of this.
For probabilistically-guaranteed methods...
I think it’s important to distinguish between two (or maybe three) different types of probabilistic guarantees; I’m not sure whether you would consider all of the below “probabilistic” or whether some of them count as non-probabilistic, so I’ll elaborate on each type.
The first, which I presume is what you are talking about, is when the probability is due to some assumed distribution over nature. In this case, if I’m willing to make such an assumption, then I’d rather just go the full-on Bayesian route, unless there’s some compelling reason like computational tractability to eschew it. And indeed, there exist cases where, given distributional assumptions, we can infer the parameters efficiently using a frequentist estimation technique, while the Bayesian analog runs into NP-hardness obstacles, at least in some regimes. But there are other instances where the Bayesian method is far cheaper computationally than the go-to frequentist technique for the same problem (e.g. generative vs. discriminative models for syntactic parsing), so I only mean to bring this up as an example.
The second type of guarantee is in terms of randomness generated by the algorithm, without making any assumptions about nature (other than that we have access to a random number generator that is sufficiently independent from what we are trying to predict). I’m pretty happy with this sort of guarantee, since it requires fairly weak epistemic commitments.
The third type of guarantee is somewhat in the middle: it is given by a partial constraint on the distribution. As an example, maybe I’m willing to assume knowledge of certain moments of the distribution. For sufficiently few moments, I can estimate them all accurately from empirical data, and I can even bound the error to within high probability, making no assumption other than independence of my samples. In this case, as long as I’m okay with making the independence assumption, then I consider this guarantee to be pretty good as well (as long as I can bound the error introduced into the method by the inexact estimation of the moments, which there are good techniques for doing). I think the epistemic commitments for this type of method are, modulo making an independence assumption, not really any stronger than those for the second type of method, so I’m also fairly okay with this case.
Thanks a lot for the thoughtful comment. I’ve included some of my own thoughts below / also some clarifications.
Do you think that online learning methods count as an example of this?
I think this is a valid objection, but I’ll make two partial counter-arguments. The first is that, arguably, there may be some information that is not easy to incorporate as a prior but is easy to incorporate under some sort of minimax formalism. So Bayes may be forced to leave money on the table in the same way.
A more concrete response is that, often, an appropriate regularizer can incorporate similar information to what a prior would incorporate. I think the regularizer that I exhibited in Myth 6 is one example of this.
I think it’s important to distinguish between two (or maybe three) different types of probabilistic guarantees; I’m not sure whether you would consider all of the below “probabilistic” or whether some of them count as non-probabilistic, so I’ll elaborate on each type.
The first, which I presume is what you are talking about, is when the probability is due to some assumed distribution over nature. In this case, if I’m willing to make such an assumption, then I’d rather just go the full-on Bayesian route, unless there’s some compelling reason like computational tractability to eschew it. And indeed, there exist cases where, given distributional assumptions, we can infer the parameters efficiently using a frequentist estimation technique, while the Bayesian analog runs into NP-hardness obstacles, at least in some regimes. But there are other instances where the Bayesian method is far cheaper computationally than the go-to frequentist technique for the same problem (e.g. generative vs. discriminative models for syntactic parsing), so I only mean to bring this up as an example.
The second type of guarantee is in terms of randomness generated by the algorithm, without making any assumptions about nature (other than that we have access to a random number generator that is sufficiently independent from what we are trying to predict). I’m pretty happy with this sort of guarantee, since it requires fairly weak epistemic commitments.
The third type of guarantee is somewhat in the middle: it is given by a partial constraint on the distribution. As an example, maybe I’m willing to assume knowledge of certain moments of the distribution. For sufficiently few moments, I can estimate them all accurately from empirical data, and I can even bound the error to within high probability, making no assumption other than independence of my samples. In this case, as long as I’m okay with making the independence assumption, then I consider this guarantee to be pretty good as well (as long as I can bound the error introduced into the method by the inexact estimation of the moments, which there are good techniques for doing). I think the epistemic commitments for this type of method are, modulo making an independence assumption, not really any stronger than those for the second type of method, so I’m also fairly okay with this case.
If you can cook up examples of this, that would be helpful.