Some thoughts about an estimator by Taleb
I recently read Maximum ignorance probability, with applications to surgery’s error rates by N.N. Taleb where he proposes a new estimator for the parameter of a Bernoulli random variable. In this article, I review the main points of it and also share my own thoughts about it.
The estimator in question (which I will call maximum ignorance estimator) takes the following form
where is the regularized beta function, is the number of independent trials and is the number of successes.
This estimator is derived by solving the following equation
where is the cumulative distribution function of a binomial with n trials and probability p of success. In words, this estimator sets to a value such that the probability of observing successes or less is exactly . How do we pick ? The author sets to 0.5 as it maximizes the entropy (more on this later).
Finally, the estimator is applied to a real world problem. A surgeon works in area with a mortality rate of 5% and he has performed 60 procedures with no fatalities. What can we say about his error probability? By applying the estimator described earlier
Taleb argues that the empirical approach () does not provide a lot of information because the sample is small, i.e. the estimate is , however, we “know” that this is value is not 0, it is just that we have not observed enough samples to see a failure.
On the other hand, the Bayesian would pick a Beta prior for . A Beta distribution has two parameters and we have only one constraint (mortality rate of 5% on the area) which leaves us with one degree of freedom. The choice of this remaining degree of freedom is arbitrary and it is shown that it has a significant impact on the final estimate obtained.
Having gone through the main points of the article, here follow my own thoughts:
While the choice of is justified by entropy maximization, it is just a built in assumption. In the surgeon example, it means that we are assuming that if the surgeon did 60 procedures more, we believe that there is a probability of 50% of having one or more fatalities. Is this a reasonable assumption? I don’t know, but I would argue it is an assumption about a question which we did not ask, i.e. “what is the probability of seeing a more extreme event?”
A Bayesian approach can avoid the issue on how to pick the prior by ignoring the mortality rate (as the maximum ignorance estimator does) and use Laplace’s rule of succession. I think if one wants to use the mortality information, there is no objective way of setting the remaining degree of freedom without additional information about the problem.
On the same topic, it is not possible to decide which estimator is better without knowing how is going to be used. If I were trying to compare different surgeons to perform the procedure on me, I would not compare estimates of , I would also care how wrong my estimate of p can be. In this case, I would compute confidence intervals to obtain an upper bound. On the other hand, if I was designing a recommendation system and p was the probability that a particular customer of my streaming website likes horror movies, it is fine to use Laplace’s rule of succession since it is ok to be slightly off (the estimator is biased). My main concern would be avoiding setting to 0 with a small amount of samples and then end up being unable to correct this estimate.
There are many things to say about this result by N. Taleb. To start with, a minor detail: I’s would have written $\hat{p} = I^{-1}_{1/2}(m+1, n—m)$, which is much more coherent with the fact that he is inverting the CDF.
He is inverting the CDF of a Beta distribution with parameters (m+1, n-m) which is a posterior in the Beta-Binomial model of a Beta(1, 0) distribution (!!!), with no explanation at all! It would have made slightly more sense to use a Beta(1, 1) instead.
Note that all he does by selecting q = 1⁄2 choosing as this “optimal estimate” the median of the Beta(m+1, n-m) distribution, i.e., the median of the posterior distribution.
Note that he ignores completely the base rate of 5%. Cannot he make use of it at all? So, even better than a Beta(1, 1), I’d have chosen the maximum entropy distribution among those betas with mean .05. I.e., one with a large variance; in fact, Taleb complains that the Bayesian approach provides funny results with highly informative beta priors.
If I had been facing the problem, I would have inquired about the distribution of those historical records whose aggregation is a 5% average and use it as a prior to model this new doctor.
All in all, I do not thing Taleb wrote his best page on that day. But he has many other great ones to learn from!