I’ve been thinking about what program, exactly, is being defended here, and I think a good name for it might be “prior-less learning”.
To me, all procedures under the prior-less umbrella have a “minimax optimality” feel to them. Some approaches search for explicitly minimax-optimal procedures; but even more broadly, all such approaches aim to secure guarantees (possibly probabilistic) that the worst-case performance of a given procedure is as limited as possible within some contemplated set of possible states of the world. I have a couple of things to say about such ideas.
First, for the non-probabilistically guaranteed methods: these are relatively few and far between, and for any such procedure it must be ensured that the loss that is being guaranteed is relevant to the problem at hand. That said, there is only one possible objection to them, and it is the same as one of my objections to prior-less probabilistically guaranteed methods. That objection applies generically to the minimaxity of the prior-less learning program: when strong prior information exists but is difficult to incorporate into the method, the results of the method can “leave money on the table”, as it were. Sometimes this can be caught and fixed, generally in a post hoc and ad hoc way; sometimes not.
For probabilistically-guaranteed methods, there is a epistemic gap—in principle -- in going from the properties of such procedures in classes of repeating situations (i.e., pre-data claims about the procedure) to well-warranted claims in the cases at hand (i.e., post-data claims about the world). But it’s obvious that this is merely an in-principle objection—after all, many such techniques can be and have been successfully applied to learn true things about the world. The important question is then: does the heretofore implicit principle justifying the bridging of this gap differ significantly from the principle justifying Bayesian learning?
Thanks a lot for the thoughtful comment. I’ve included some of my own thoughts below / also some clarifications.
First, for the non-probabilistically guaranteed methods: these are relatively few and far between
Do you think that online learning methods count as an example of this?
when strong prior information exists but is difficult to incorporate into the method, the results of the method can “leave money on the table”, as it were
I think this is a valid objection, but I’ll make two partial counter-arguments. The first is that, arguably, there may be some information that is not easy to incorporate as a prior but is easy to incorporate under some sort of minimax formalism. So Bayes may be forced to leave money on the table in the same way.
A more concrete response is that, often, an appropriate regularizer can incorporate similar information to what a prior would incorporate. I think the regularizer that I exhibited in Myth 6 is one example of this.
For probabilistically-guaranteed methods...
I think it’s important to distinguish between two (or maybe three) different types of probabilistic guarantees; I’m not sure whether you would consider all of the below “probabilistic” or whether some of them count as non-probabilistic, so I’ll elaborate on each type.
The first, which I presume is what you are talking about, is when the probability is due to some assumed distribution over nature. In this case, if I’m willing to make such an assumption, then I’d rather just go the full-on Bayesian route, unless there’s some compelling reason like computational tractability to eschew it. And indeed, there exist cases where, given distributional assumptions, we can infer the parameters efficiently using a frequentist estimation technique, while the Bayesian analog runs into NP-hardness obstacles, at least in some regimes. But there are other instances where the Bayesian method is far cheaper computationally than the go-to frequentist technique for the same problem (e.g. generative vs. discriminative models for syntactic parsing), so I only mean to bring this up as an example.
The second type of guarantee is in terms of randomness generated by the algorithm, without making any assumptions about nature (other than that we have access to a random number generator that is sufficiently independent from what we are trying to predict). I’m pretty happy with this sort of guarantee, since it requires fairly weak epistemic commitments.
The third type of guarantee is somewhat in the middle: it is given by a partial constraint on the distribution. As an example, maybe I’m willing to assume knowledge of certain moments of the distribution. For sufficiently few moments, I can estimate them all accurately from empirical data, and I can even bound the error to within high probability, making no assumption other than independence of my samples. In this case, as long as I’m okay with making the independence assumption, then I consider this guarantee to be pretty good as well (as long as I can bound the error introduced into the method by the inexact estimation of the moments, which there are good techniques for doing). I think the epistemic commitments for this type of method are, modulo making an independence assumption, not really any stronger than those for the second type of method, so I’m also fairly okay with this case.
For probabilistically-guaranteed methods, there is a epistemic gap—in principle—in going from the properties of such procedures in classes of repeating situations (i.e., pre-data claims about the procedure) to well-warranted claims in the cases at hand (i.e., post-data claims about the world).
Well, if you believe post-data probabilities reflect real knowledge, then that’s a start. Because, you can think of pre-data probabilities as more conservative versions of post-data probabilities. That is, if pre-data calculations tell you to be sure of something, you can probably be at least that sure, post-data.
The example that’s guiding me here is a confidence interval. When you derive a confidence interval, you’re really calculating the probability that some parameter of interest R will be between two estimators E1 and E2.
%20=%20.95)
Post-data, you just calculate E1 and E2 from the data and call that your 95\% confidence interval. So you’re still using the pre-data probability that R is between those two estimators.
I know of two precise senses in which the pre-data probabilities are conservative, when you use them in this way.
Sense the first: Let H be the hypothesis that E1<R<E2. H is probably true, so you’re probably going to get evidence in favor of it. The post-data probability, then, will probably be higher than the pre-data probability.
So, epistemically… I don’t know. If you’re doing many experiments, this explains why using pre-data probabilities is a conservative strategy: in most experiments, you’re underestimating the probability that the parameter is between the estimators. Or, you can view this as logical uncertainty about a post-data probability that you don’t know how to calculate: you think that if you did the calculation, it would probably make you more, rather than less sure that the parameter is between the estimators.
Another precise sense in which the pre-data probabilities are more conservative is that pre-data probability distributions have higher entropy than post-data ones, on average.
Let’s say R and D are random variables. Let H(R) be the entropy of the probability distribution of R, likewise for D. That is,
%20=%20E[-\log{P(D)}])
I hope this notation is clear… see, usually I’d write P(D=d), but when it’s in an expectation operator, I want to make it clear that D is a random variable that the expectation operator is integrating over, so I write things like E[P(D)] (the expected value of P(D=d) when d is randomly selected).
Define the conditional entropy as follows:
%20=%20E[-\log{P(R|D=d)}|D=d])
The theorem, then, is this:
]%20\le%20H(R))
(I don’t have a free reference on hand, but it’s theorem 9.3.2 in Sheldon Ross’s “A First Course in Probability”)
So, imagine that R is a paRameter and D is some Data. And note that the expectation is not conditional on D, all this is in the pre-data state of knowledge. So what this theorem means is that, before seeing the data, the expected value of the post-data entropy is below the current entropy.
This one’s a little weirder to interpret, but it clearly seems to be saying something relevant. As a statement about doing many independent experiments, it means that the average pre-data distribution entropy is higher than the average post-data distribution entropy, so when you use the pre-data probabilities, you’re taking them from a higher-entropy distribution. So that’s a sense in which you could call it a conservative strategy: it tends to use a probability distribution that’s too spread out. As a statement about logical uncertainty, when you haven’t calculated the post-data probabilities, I guess it could mean that your best estimate of the post-data entropy is lower than the entropy of the pre-data distribution. Which means, if your best estimate is near true, you’re using a distribution that’s too spread out, not too concentrated.
So that’s what I’ve got. I think there’s a lot more to be said here. I haven’t read about this topic, I’m just putting together some stuff that I’ve observed incidentally, so I would appreciate a reference. But what it adds up to is that using pre-data probabilities is a conservative strategy.
And the reason that’s important is because conservative strategies can be really useful for science. Sometimes you wanna gather evidence until you’ve got enough that you can publish and say that you’ve proved something with confidence. Conservative calculations can often show what you want to show, which is that your evidence is sufficient.
I’ve been thinking about what program, exactly, is being defended here, and I think a good name for it might be “prior-less learning”. To me, all procedures under the prior-less umbrella have a “minimax optimality” feel to them. Some approaches search for explicitly minimax-optimal procedures; but even more broadly, all such approaches aim to secure guarantees (possibly probabilistic) that the worst-case performance of a given procedure is as limited as possible within some contemplated set of possible states of the world. I have a couple of things to say about such ideas.
First, for the non-probabilistically guaranteed methods: these are relatively few and far between, and for any such procedure it must be ensured that the loss that is being guaranteed is relevant to the problem at hand. That said, there is only one possible objection to them, and it is the same as one of my objections to prior-less probabilistically guaranteed methods. That objection applies generically to the minimaxity of the prior-less learning program: when strong prior information exists but is difficult to incorporate into the method, the results of the method can “leave money on the table”, as it were. Sometimes this can be caught and fixed, generally in a post hoc and ad hoc way; sometimes not.
For probabilistically-guaranteed methods, there is a epistemic gap—in principle -- in going from the properties of such procedures in classes of repeating situations (i.e., pre-data claims about the procedure) to well-warranted claims in the cases at hand (i.e., post-data claims about the world). But it’s obvious that this is merely an in-principle objection—after all, many such techniques can be and have been successfully applied to learn true things about the world. The important question is then: does the heretofore implicit principle justifying the bridging of this gap differ significantly from the principle justifying Bayesian learning?
Thanks a lot for the thoughtful comment. I’ve included some of my own thoughts below / also some clarifications.
Do you think that online learning methods count as an example of this?
I think this is a valid objection, but I’ll make two partial counter-arguments. The first is that, arguably, there may be some information that is not easy to incorporate as a prior but is easy to incorporate under some sort of minimax formalism. So Bayes may be forced to leave money on the table in the same way.
A more concrete response is that, often, an appropriate regularizer can incorporate similar information to what a prior would incorporate. I think the regularizer that I exhibited in Myth 6 is one example of this.
I think it’s important to distinguish between two (or maybe three) different types of probabilistic guarantees; I’m not sure whether you would consider all of the below “probabilistic” or whether some of them count as non-probabilistic, so I’ll elaborate on each type.
The first, which I presume is what you are talking about, is when the probability is due to some assumed distribution over nature. In this case, if I’m willing to make such an assumption, then I’d rather just go the full-on Bayesian route, unless there’s some compelling reason like computational tractability to eschew it. And indeed, there exist cases where, given distributional assumptions, we can infer the parameters efficiently using a frequentist estimation technique, while the Bayesian analog runs into NP-hardness obstacles, at least in some regimes. But there are other instances where the Bayesian method is far cheaper computationally than the go-to frequentist technique for the same problem (e.g. generative vs. discriminative models for syntactic parsing), so I only mean to bring this up as an example.
The second type of guarantee is in terms of randomness generated by the algorithm, without making any assumptions about nature (other than that we have access to a random number generator that is sufficiently independent from what we are trying to predict). I’m pretty happy with this sort of guarantee, since it requires fairly weak epistemic commitments.
The third type of guarantee is somewhat in the middle: it is given by a partial constraint on the distribution. As an example, maybe I’m willing to assume knowledge of certain moments of the distribution. For sufficiently few moments, I can estimate them all accurately from empirical data, and I can even bound the error to within high probability, making no assumption other than independence of my samples. In this case, as long as I’m okay with making the independence assumption, then I consider this guarantee to be pretty good as well (as long as I can bound the error introduced into the method by the inexact estimation of the moments, which there are good techniques for doing). I think the epistemic commitments for this type of method are, modulo making an independence assumption, not really any stronger than those for the second type of method, so I’m also fairly okay with this case.
If you can cook up examples of this, that would be helpful.
Well, if you believe post-data probabilities reflect real knowledge, then that’s a start. Because, you can think of pre-data probabilities as more conservative versions of post-data probabilities. That is, if pre-data calculations tell you to be sure of something, you can probably be at least that sure, post-data.
The example that’s guiding me here is a confidence interval. When you derive a confidence interval, you’re really calculating the probability that some parameter of interest R will be between two estimators E1 and E2.
%20=%20.95)Post-data, you just calculate E1 and E2 from the data and call that your 95\% confidence interval. So you’re still using the pre-data probability that R is between those two estimators.
I know of two precise senses in which the pre-data probabilities are conservative, when you use them in this way.
Sense the first: Let H be the hypothesis that E1<R<E2. H is probably true, so you’re probably going to get evidence in favor of it. The post-data probability, then, will probably be higher than the pre-data probability.
So, epistemically… I don’t know. If you’re doing many experiments, this explains why using pre-data probabilities is a conservative strategy: in most experiments, you’re underestimating the probability that the parameter is between the estimators. Or, you can view this as logical uncertainty about a post-data probability that you don’t know how to calculate: you think that if you did the calculation, it would probably make you more, rather than less sure that the parameter is between the estimators.
Another precise sense in which the pre-data probabilities are more conservative is that pre-data probability distributions have higher entropy than post-data ones, on average.
Let’s say R and D are random variables. Let H(R) be the entropy of the probability distribution of R, likewise for D. That is,
%20=%20E[-\log{P(D)}])I hope this notation is clear… see, usually I’d write P(D=d), but when it’s in an expectation operator, I want to make it clear that D is a random variable that the expectation operator is integrating over, so I write things like E[P(D)] (the expected value of P(D=d) when d is randomly selected).
Define the conditional entropy as follows:
%20=%20E[-\log{P(R|D=d)}|D=d])The theorem, then, is this:
]%20\le%20H(R))(I don’t have a free reference on hand, but it’s theorem 9.3.2 in Sheldon Ross’s “A First Course in Probability”)
So, imagine that R is a paRameter and D is some Data. And note that the expectation is not conditional on D, all this is in the pre-data state of knowledge. So what this theorem means is that, before seeing the data, the expected value of the post-data entropy is below the current entropy.
This one’s a little weirder to interpret, but it clearly seems to be saying something relevant. As a statement about doing many independent experiments, it means that the average pre-data distribution entropy is higher than the average post-data distribution entropy, so when you use the pre-data probabilities, you’re taking them from a higher-entropy distribution. So that’s a sense in which you could call it a conservative strategy: it tends to use a probability distribution that’s too spread out. As a statement about logical uncertainty, when you haven’t calculated the post-data probabilities, I guess it could mean that your best estimate of the post-data entropy is lower than the entropy of the pre-data distribution. Which means, if your best estimate is near true, you’re using a distribution that’s too spread out, not too concentrated.
So that’s what I’ve got. I think there’s a lot more to be said here. I haven’t read about this topic, I’m just putting together some stuff that I’ve observed incidentally, so I would appreciate a reference. But what it adds up to is that using pre-data probabilities is a conservative strategy.
And the reason that’s important is because conservative strategies can be really useful for science. Sometimes you wanna gather evidence until you’ve got enough that you can publish and say that you’ve proved something with confidence. Conservative calculations can often show what you want to show, which is that your evidence is sufficient.