Ah, okay. In that case, here are a few attempts to ground the idea philosophically:
It’s the “prior-free” estimate with the least error. See that unbiased “prior-free” estimates must be mixtures of the (unbiased) estimates, and that biased estimates are dominated by being scaled to fit. So the best you can do is to pick the mixture that minimizes variance, which this is.
It actually is the point that maximizes the product of likelihoods (equivalently, the joint likelihood, since the estimate errors are assumed to be independent). You can see this by remembering that the Normal pdf is the inverse exponential quadratic, so you maximize the product of likelihoods by maximizing the sum of log-likelihoods, which happens where the log-likelihood slopes are each the negative of the other, which happens when distances are inversely proportional to the x^2 coefficients (or the weights are inversely proportional to the variances).
There’s a pseudo-frequentist(?) version of this, where you treat each estimate as an assembly of (higher-variance) estimates at the same point, notice that the count is inversely proportional to the variance, and take the total population mean as your estimator. (You might like the mean for its L2-minimizing properties.)
A Bayesian interpretation is that, given the improper prior uniformly distributed over numbers and treating the two as independent pieces of evidence, the given formula gives the mode of the posterior (and, since the posterior is Normal, gives its mean and median as well).
Sorry, I’m writing pretty informally here. I’m pretty sure that there are senses in which these arguments can be made formal, though I’m not really interested in going through that here, mostly because I don’t think formality wins us anything interesting here.
Some notes, though: (still in a fairly informal mode)
My intuition that the only way to combine the two estimates without introducing a bias or assumed prior is by a mixture comes from treating each estimate (treated as a random variable) as a true estimate plus some idiosyncratic noise. Then any function of them yields an expression in terms of true estimate, each respective estimator’s noise, and maybe other constants. But “unbiased” implies that setting the noise terms to 0 should set the expression equal to the true estimate (in expectation). Without making assumptions about the actual distribution of true values, this needs to just be 1 times the true estimate (plusmaybe some other noise you don’t want, which I think you can get rid of). And the only way you get there from the noisy estimates is a mixture.
By “assembly”, I’m proposing to treat each estimate as a larger number of estimates with the same mean and larger variance, such that they form equivalent evidence. Intuitively, this works out if the count goes as the square of the variance ratio. Then I claim that the natural thing to do with many estimates each of the same variance is to take a straight average.
But they’re distributions, not observations.
Sure, formally each observer’s posterior is a distribution. But if you treat “observer 1′s posterior is Normally distributed, with mean G1 and standard deviation σ1” as an observation you make as a Bayesian (who trusts observer 1′s estimation and calibration), it gets you there.
Ah, okay. In that case, here are a few attempts to ground the idea philosophically:
It’s the “prior-free” estimate with the least error. See that unbiased “prior-free” estimates must be mixtures of the (unbiased) estimates, and that biased estimates are dominated by being scaled to fit. So the best you can do is to pick the mixture that minimizes variance, which this is.
It actually is the point that maximizes the product of likelihoods (equivalently, the joint likelihood, since the estimate errors are assumed to be independent). You can see this by remembering that the Normal pdf is the inverse exponential quadratic, so you maximize the product of likelihoods by maximizing the sum of log-likelihoods, which happens where the log-likelihood slopes are each the negative of the other, which happens when distances are inversely proportional to the x^2 coefficients (or the weights are inversely proportional to the variances).
There’s a pseudo-frequentist(?) version of this, where you treat each estimate as an assembly of (higher-variance) estimates at the same point, notice that the count is inversely proportional to the variance, and take the total population mean as your estimator. (You might like the mean for its L2-minimizing properties.)
A Bayesian interpretation is that, given the improper prior uniformly distributed over numbers and treating the two as independent pieces of evidence, the given formula gives the mode of the posterior (and, since the posterior is Normal, gives its mean and median as well).
Are any of those compelling?
I don’t follow.
What’s an “assembly of estimates”?
But they’re distributions, not observations.
Sorry, I’m writing pretty informally here. I’m pretty sure that there are senses in which these arguments can be made formal, though I’m not really interested in going through that here, mostly because I don’t think formality wins us anything interesting here.
Some notes, though: (still in a fairly informal mode)
My intuition that the only way to combine the two estimates without introducing a bias or assumed prior is by a mixture comes from treating each estimate (treated as a random variable) as a true estimate plus some idiosyncratic noise. Then any function of them yields an expression in terms of true estimate, each respective estimator’s noise, and maybe other constants. But “unbiased” implies that setting the noise terms to 0 should set the expression equal to the true estimate (in expectation). Without making assumptions about the actual distribution of true values, this needs to just be 1 times the true estimate (plusmaybe some other noise you don’t want, which I think you can get rid of). And the only way you get there from the noisy estimates is a mixture.
By “assembly”, I’m proposing to treat each estimate as a larger number of estimates with the same mean and larger variance, such that they form equivalent evidence. Intuitively, this works out if the count goes as the square of the variance ratio. Then I claim that the natural thing to do with many estimates each of the same variance is to take a straight average.
Sure, formally each observer’s posterior is a distribution. But if you treat “observer 1′s posterior is Normally distributed, with mean G1 and standard deviation σ1” as an observation you make as a Bayesian (who trusts observer 1′s estimation and calibration), it gets you there.
I’m not sure I’m familiar with the word “mixture” in the way you’re using it.
I mean a weighted sum where weights add to unity.