Are we using summand to mean the same thing here? To me, if we have an expression X1 + X2 + X3, then the summands are X1, X2, and X3. If we want to estimate Y, and E[X1+X2+X3] = Y, but E[X2] is close to 0 while Var[X2] is large, then X1+X3 is a better estimate for Y than X1+X2+X3 is.
Assume you have noisy measurements X1, X2, X3 of physical quantities Y1, Y2, Y3 respectively; variables 1, 2, and 3 are independent; X2 is much noisier than the others; and you want a point-estimate of Y = Y1+Y2+Y3. Then you shouldn’t use either X1+X2+X3 or X1+X3. You should use E[Y1|X1] + E[Y2|X2] + E[Y3|X3]. Regression to the mean is involved in computing each of the conditional expectations. Lots of noise (relative to the width of your prior) in X2 means that E[Y2|X2] will tend to be close to the prior E[Y2] even for extreme values of X2, but E[Y2|X2] is still a better estimate of that portion of the sum than E[Y2] is.
You said you should drop X if you know that your estimate is high variance but that the actual values don’t vary much. Knowing that the actual value doesn’t vary much means your prior has low variance, while knowing that your estimate is noisy means that your prior for the error term has high variance.
So when you observe an estimate, you should attribute most of the variance to error, and regress your estimate substantially towards your prior mean. After doing that regression, you are better off including X than dropping it, as far as I can see. (Of course, if the regressed estimate is sufficiently small then it wasn’t even worth computing the estimate, but that’s a normal issue with allocating bounded computational resources and doesn’t depend on the variance of your estimate of X, just how large you expect the real value to be.)
Are we using summand to mean the same thing here? To me, if we have an expression X1 + X2 + X3, then the summands are X1, X2, and X3. If we want to estimate Y, and E[X1+X2+X3] = Y, but E[X2] is close to 0 while Var[X2] is large, then X1+X3 is a better estimate for Y than X1+X2+X3 is.
Assume you have noisy measurements X1, X2, X3 of physical quantities Y1, Y2, Y3 respectively; variables 1, 2, and 3 are independent; X2 is much noisier than the others; and you want a point-estimate of Y = Y1+Y2+Y3. Then you shouldn’t use either X1+X2+X3 or X1+X3. You should use E[Y1|X1] + E[Y2|X2] + E[Y3|X3]. Regression to the mean is involved in computing each of the conditional expectations. Lots of noise (relative to the width of your prior) in X2 means that E[Y2|X2] will tend to be close to the prior E[Y2] even for extreme values of X2, but E[Y2|X2] is still a better estimate of that portion of the sum than E[Y2] is.
But that’s not mysterious, that’s just regression to the mean.
I don’t understand—in what way is it regression to the mean?
Also, what does that have to do with my original comment, which is that you will do better by dropping high-variance terms?
You said you should drop X if you know that your estimate is high variance but that the actual values don’t vary much. Knowing that the actual value doesn’t vary much means your prior has low variance, while knowing that your estimate is noisy means that your prior for the error term has high variance.
So when you observe an estimate, you should attribute most of the variance to error, and regress your estimate substantially towards your prior mean. After doing that regression, you are better off including X than dropping it, as far as I can see. (Of course, if the regressed estimate is sufficiently small then it wasn’t even worth computing the estimate, but that’s a normal issue with allocating bounded computational resources and doesn’t depend on the variance of your estimate of X, just how large you expect the real value to be.)