The estimator in question (which I will call maximum ignorance estimator) takes the following form
^p=1−I−10.5(n−m,m+1)
where I is the regularized beta function, n is the number of independent trials and m is the number of successes.
This estimator is derived by solving the following equation
Fp(m)=q
where Fp is the cumulative distribution function of a binomial with n trials and probability p of success. In words, this estimator sets p to a value such that the probability of observing m successes or less is exactly q. How do we pick q? The author sets q to 0.5 as it maximizes the entropy (more on this later).
Finally, the estimator is applied to a real world problem. A surgeon works in area with a mortality rate of 5% and he has performed 60 procedures with no fatalities. What can we say about his error probability? By applying the estimator described earlier ^p=0.01148.
Taleb argues that the empirical approach (^p=m/n) does not provide a lot of information because the sample is small, i.e. the estimate is ^p=0, however, we “know” that this is value is not 0, it is just that we have not observed enough samples to see a failure.
On the other hand, the Bayesian would pick a Beta prior for p. A Beta distribution has two parameters and we have only one constraint (mortality rate of 5% on the area) which leaves us with one degree of freedom. The choice of this remaining degree of freedom is arbitrary and it is shown that it has a significant impact on the final estimate obtained.
Having gone through the main points of the article, here follow my own thoughts:
While the choice of q=0.5 is justified by entropy maximization, it is just a built in assumption. In the surgeon example, it means that we are assuming that if the surgeon did 60 procedures more, we believe that there is a probability of 50% of having one or more fatalities. Is this a reasonable assumption? I don’t know, but I would argue it is an assumption about a question which we did not ask, i.e. “what is the probability of seeing a more extreme event?”
A Bayesian approach can avoid the issue on how to pick the prior by ignoring the mortality rate (as the maximum ignorance estimator does) and use Laplace’s rule of succession. I think if one wants to use the mortality information, there is no objective way of setting the remaining degree of freedom without additional information about the problem.
On the same topic, it is not possible to decide which estimator is better without knowing how is ^p going to be used. If I were trying to compare different surgeons to perform the procedure on me, I would not compare estimates of p, I would also care how wrong my estimate of p can be. In this case, I would compute confidence intervals to obtain an upper bound. On the other hand, if I was designing a recommendation system and p was the probability that a particular customer of my streaming website likes horror movies, it is fine to use Laplace’s rule of succession since it is ok to be slightly off (the estimator is biased). My main concern would be avoiding setting ^p to 0 with a small amount of samples and then end up being unable to correct this estimate.
Some thoughts about an estimator by Taleb
I recently read Maximum ignorance probability, with applications to surgery’s error rates by N.N. Taleb where he proposes a new estimator for the parameter p of a Bernoulli random variable. In this article, I review the main points of it and also share my own thoughts about it.
The estimator in question (which I will call maximum ignorance estimator) takes the following form
^p=1−I−10.5(n−m,m+1)
where I is the regularized beta function, n is the number of independent trials and m is the number of successes.
This estimator is derived by solving the following equation
Fp(m)=q
where Fp is the cumulative distribution function of a binomial with n trials and probability p of success. In words, this estimator sets p to a value such that the probability of observing m successes or less is exactly q. How do we pick q? The author sets q to 0.5 as it maximizes the entropy (more on this later).
Finally, the estimator is applied to a real world problem. A surgeon works in area with a mortality rate of 5% and he has performed 60 procedures with no fatalities. What can we say about his error probability? By applying the estimator described earlier ^p=0.01148.
Taleb argues that the empirical approach (^p=m/n) does not provide a lot of information because the sample is small, i.e. the estimate is ^p=0, however, we “know” that this is value is not 0, it is just that we have not observed enough samples to see a failure.
On the other hand, the Bayesian would pick a Beta prior for p. A Beta distribution has two parameters and we have only one constraint (mortality rate of 5% on the area) which leaves us with one degree of freedom. The choice of this remaining degree of freedom is arbitrary and it is shown that it has a significant impact on the final estimate obtained.
Having gone through the main points of the article, here follow my own thoughts:
While the choice of q=0.5 is justified by entropy maximization, it is just a built in assumption. In the surgeon example, it means that we are assuming that if the surgeon did 60 procedures more, we believe that there is a probability of 50% of having one or more fatalities. Is this a reasonable assumption? I don’t know, but I would argue it is an assumption about a question which we did not ask, i.e. “what is the probability of seeing a more extreme event?”
A Bayesian approach can avoid the issue on how to pick the prior by ignoring the mortality rate (as the maximum ignorance estimator does) and use Laplace’s rule of succession. I think if one wants to use the mortality information, there is no objective way of setting the remaining degree of freedom without additional information about the problem.
On the same topic, it is not possible to decide which estimator is better without knowing how is ^p going to be used. If I were trying to compare different surgeons to perform the procedure on me, I would not compare estimates of p, I would also care how wrong my estimate of p can be. In this case, I would compute confidence intervals to obtain an upper bound. On the other hand, if I was designing a recommendation system and p was the probability that a particular customer of my streaming website likes horror movies, it is fine to use Laplace’s rule of succession since it is ok to be slightly off (the estimator is biased). My main concern would be avoiding setting ^p to 0 with a small amount of samples and then end up being unable to correct this estimate.