Can anyone think of a theoretical justification (Bayesian, Frequentist, whatever) for the procedure described in this blog post? I think this guy invented it for himself—searching on Google for “blended estimate” just sends me to his post.
Are you asking for a justification for averaging independent estimates to achieve an estimate with lower errors? “Blended estimate” isn’t a specific term of art, but the general idea here is so common that I’m not sure _what_ the most common term for it is.
And the theoretical justification—under assumptions of independent and Normal errors—is at the post, where the author demonstrates that there’s a lower error from the weighted average (and that their choice of weights minimizes the error). Am I missing something here?
and that their choice of weights minimizes the error
The author has selected a weighted average such that if we treat that weighted average as a random variable, its standard deviation is minimized. But if we just want a random variable whose standard deviation is minimized, we could have a distribution which assigns 100% credence to the number 0 and be done with it. In other words, my question is whether the procedure in this post can be put on a firmer philosophical foundation. Or whether there is some alternate derivation/problem formulation (e.g. a mixture model) that gets us the same formula.
Another way of getting at the same idea: There are potentially other procedures one could use to create a “blended estimate”, for example, you could find the point such that the product of the likelihoods of the two distributions is maximized, or take a weighted average of the two estimates using e.g. (1/sigma) as the weight of each estimate. Is there a justification for using this particular loss function, of finding a random variable constructed via weighted average whose variance is minimized? It seems to me that this procedure is a little weird because it’s the random variable that corresponds to the person’s age that we really care about. We should be looking “upstream” of the estimates, but instead we’re going “downstream” (where up/down stream roughly correspond to the direction of arrows in a Bayesian network).
Ah, okay. In that case, here are a few attempts to ground the idea philosophically:
It’s the “prior-free” estimate with the least error. See that unbiased “prior-free” estimates must be mixtures of the (unbiased) estimates, and that biased estimates are dominated by being scaled to fit. So the best you can do is to pick the mixture that minimizes variance, which this is.
It actually is the point that maximizes the product of likelihoods (equivalently, the joint likelihood, since the estimate errors are assumed to be independent). You can see this by remembering that the Normal pdf is the inverse exponential quadratic, so you maximize the product of likelihoods by maximizing the sum of log-likelihoods, which happens where the log-likelihood slopes are each the negative of the other, which happens when distances are inversely proportional to the x^2 coefficients (or the weights are inversely proportional to the variances).
There’s a pseudo-frequentist(?) version of this, where you treat each estimate as an assembly of (higher-variance) estimates at the same point, notice that the count is inversely proportional to the variance, and take the total population mean as your estimator. (You might like the mean for its L2-minimizing properties.)
A Bayesian interpretation is that, given the improper prior uniformly distributed over numbers and treating the two as independent pieces of evidence, the given formula gives the mode of the posterior (and, since the posterior is Normal, gives its mean and median as well).
Sorry, I’m writing pretty informally here. I’m pretty sure that there are senses in which these arguments can be made formal, though I’m not really interested in going through that here, mostly because I don’t think formality wins us anything interesting here.
Some notes, though: (still in a fairly informal mode)
My intuition that the only way to combine the two estimates without introducing a bias or assumed prior is by a mixture comes from treating each estimate (treated as a random variable) as a true estimate plus some idiosyncratic noise. Then any function of them yields an expression in terms of true estimate, each respective estimator’s noise, and maybe other constants. But “unbiased” implies that setting the noise terms to 0 should set the expression equal to the true estimate (in expectation). Without making assumptions about the actual distribution of true values, this needs to just be 1 times the true estimate (plusmaybe some other noise you don’t want, which I think you can get rid of). And the only way you get there from the noisy estimates is a mixture.
By “assembly”, I’m proposing to treat each estimate as a larger number of estimates with the same mean and larger variance, such that they form equivalent evidence. Intuitively, this works out if the count goes as the square of the variance ratio. Then I claim that the natural thing to do with many estimates each of the same variance is to take a straight average.
But they’re distributions, not observations.
Sure, formally each observer’s posterior is a distribution. But if you treat “observer 1′s posterior is Normally distributed, with mean G1 and standard deviation σ1” as an observation you make as a Bayesian (who trusts observer 1′s estimation and calibration), it gets you there.
Can anyone think of a theoretical justification (Bayesian, Frequentist, whatever) for the procedure described in this blog post? I think this guy invented it for himself—searching on Google for “blended estimate” just sends me to his post.
Are you asking for a justification for averaging independent estimates to achieve an estimate with lower errors? “Blended estimate” isn’t a specific term of art, but the general idea here is so common that I’m not sure _what_ the most common term for it is.
And the theoretical justification—under assumptions of independent and Normal errors—is at the post, where the author demonstrates that there’s a lower error from the weighted average (and that their choice of weights minimizes the error). Am I missing something here?
The author has selected a weighted average such that if we treat that weighted average as a random variable, its standard deviation is minimized. But if we just want a random variable whose standard deviation is minimized, we could have a distribution which assigns 100% credence to the number 0 and be done with it. In other words, my question is whether the procedure in this post can be put on a firmer philosophical foundation. Or whether there is some alternate derivation/problem formulation (e.g. a mixture model) that gets us the same formula.
Another way of getting at the same idea: There are potentially other procedures one could use to create a “blended estimate”, for example, you could find the point such that the product of the likelihoods of the two distributions is maximized, or take a weighted average of the two estimates using e.g. (1/sigma) as the weight of each estimate. Is there a justification for using this particular loss function, of finding a random variable constructed via weighted average whose variance is minimized? It seems to me that this procedure is a little weird because it’s the random variable that corresponds to the person’s age that we really care about. We should be looking “upstream” of the estimates, but instead we’re going “downstream” (where up/down stream roughly correspond to the direction of arrows in a Bayesian network).
Ah, okay. In that case, here are a few attempts to ground the idea philosophically:
It’s the “prior-free” estimate with the least error. See that unbiased “prior-free” estimates must be mixtures of the (unbiased) estimates, and that biased estimates are dominated by being scaled to fit. So the best you can do is to pick the mixture that minimizes variance, which this is.
It actually is the point that maximizes the product of likelihoods (equivalently, the joint likelihood, since the estimate errors are assumed to be independent). You can see this by remembering that the Normal pdf is the inverse exponential quadratic, so you maximize the product of likelihoods by maximizing the sum of log-likelihoods, which happens where the log-likelihood slopes are each the negative of the other, which happens when distances are inversely proportional to the x^2 coefficients (or the weights are inversely proportional to the variances).
There’s a pseudo-frequentist(?) version of this, where you treat each estimate as an assembly of (higher-variance) estimates at the same point, notice that the count is inversely proportional to the variance, and take the total population mean as your estimator. (You might like the mean for its L2-minimizing properties.)
A Bayesian interpretation is that, given the improper prior uniformly distributed over numbers and treating the two as independent pieces of evidence, the given formula gives the mode of the posterior (and, since the posterior is Normal, gives its mean and median as well).
Are any of those compelling?
I don’t follow.
What’s an “assembly of estimates”?
But they’re distributions, not observations.
Sorry, I’m writing pretty informally here. I’m pretty sure that there are senses in which these arguments can be made formal, though I’m not really interested in going through that here, mostly because I don’t think formality wins us anything interesting here.
Some notes, though: (still in a fairly informal mode)
My intuition that the only way to combine the two estimates without introducing a bias or assumed prior is by a mixture comes from treating each estimate (treated as a random variable) as a true estimate plus some idiosyncratic noise. Then any function of them yields an expression in terms of true estimate, each respective estimator’s noise, and maybe other constants. But “unbiased” implies that setting the noise terms to 0 should set the expression equal to the true estimate (in expectation). Without making assumptions about the actual distribution of true values, this needs to just be 1 times the true estimate (plusmaybe some other noise you don’t want, which I think you can get rid of). And the only way you get there from the noisy estimates is a mixture.
By “assembly”, I’m proposing to treat each estimate as a larger number of estimates with the same mean and larger variance, such that they form equivalent evidence. Intuitively, this works out if the count goes as the square of the variance ratio. Then I claim that the natural thing to do with many estimates each of the same variance is to take a straight average.
Sure, formally each observer’s posterior is a distribution. But if you treat “observer 1′s posterior is Normally distributed, with mean G1 and standard deviation σ1” as an observation you make as a Bayesian (who trusts observer 1′s estimation and calibration), it gets you there.
I’m not sure I’m familiar with the word “mixture” in the way you’re using it.
I mean a weighted sum where weights add to unity.