For example, to make this point in the case of sums with biased terms, you would need to say how you could predictably do better by throwing out terms of an estimation, even when you don’t expect their inclusion to be correlated with their contribution to the estimate.
I agree with pretty much everything else you wrote here (and in the OP), but I’m a bit confused by this line. It seems like if the terms have a mean that is close to zero, but high variance, then you will usually do better by getting rid of them.
I’m not convinced of this. If you know that a summand has a mean that is close to zero and a high variance, then your prior will be sharply concentrated and you will regress far to the mean. Including the regressed estimate in a sum will still increase your accuracy. (Though of course if the noise is expected to be 1000x greater than the signal, you will be dividing by a factor of 1000 which is more or less the same as throwing it out. But the naive Bayesian EV maximizer will still get this one right.)
Are we using summand to mean the same thing here? To me, if we have an expression X1 + X2 + X3, then the summands are X1, X2, and X3. If we want to estimate Y, and E[X1+X2+X3] = Y, but E[X2] is close to 0 while Var[X2] is large, then X1+X3 is a better estimate for Y than X1+X2+X3 is.
Assume you have noisy measurements X1, X2, X3 of physical quantities Y1, Y2, Y3 respectively; variables 1, 2, and 3 are independent; X2 is much noisier than the others; and you want a point-estimate of Y = Y1+Y2+Y3. Then you shouldn’t use either X1+X2+X3 or X1+X3. You should use E[Y1|X1] + E[Y2|X2] + E[Y3|X3]. Regression to the mean is involved in computing each of the conditional expectations. Lots of noise (relative to the width of your prior) in X2 means that E[Y2|X2] will tend to be close to the prior E[Y2] even for extreme values of X2, but E[Y2|X2] is still a better estimate of that portion of the sum than E[Y2] is.
You said you should drop X if you know that your estimate is high variance but that the actual values don’t vary much. Knowing that the actual value doesn’t vary much means your prior has low variance, while knowing that your estimate is noisy means that your prior for the error term has high variance.
So when you observe an estimate, you should attribute most of the variance to error, and regress your estimate substantially towards your prior mean. After doing that regression, you are better off including X than dropping it, as far as I can see. (Of course, if the regressed estimate is sufficiently small then it wasn’t even worth computing the estimate, but that’s a normal issue with allocating bounded computational resources and doesn’t depend on the variance of your estimate of X, just how large you expect the real value to be.)
Of course, any time you toss something out it corresponds to negligible weight. And of course, accuracy-wise, under limited computing power, you’re better off actually tossing it out and using the computing time elsewhere to increase the accuracy more.
I agree with pretty much everything else you wrote here (and in the OP), but I’m a bit confused by this line. It seems like if the terms have a mean that is close to zero, but high variance, then you will usually do better by getting rid of them.
I’m not convinced of this. If you know that a summand has a mean that is close to zero and a high variance, then your prior will be sharply concentrated and you will regress far to the mean. Including the regressed estimate in a sum will still increase your accuracy. (Though of course if the noise is expected to be 1000x greater than the signal, you will be dividing by a factor of 1000 which is more or less the same as throwing it out. But the naive Bayesian EV maximizer will still get this one right.)
Are we using summand to mean the same thing here? To me, if we have an expression X1 + X2 + X3, then the summands are X1, X2, and X3. If we want to estimate Y, and E[X1+X2+X3] = Y, but E[X2] is close to 0 while Var[X2] is large, then X1+X3 is a better estimate for Y than X1+X2+X3 is.
Assume you have noisy measurements X1, X2, X3 of physical quantities Y1, Y2, Y3 respectively; variables 1, 2, and 3 are independent; X2 is much noisier than the others; and you want a point-estimate of Y = Y1+Y2+Y3. Then you shouldn’t use either X1+X2+X3 or X1+X3. You should use E[Y1|X1] + E[Y2|X2] + E[Y3|X3]. Regression to the mean is involved in computing each of the conditional expectations. Lots of noise (relative to the width of your prior) in X2 means that E[Y2|X2] will tend to be close to the prior E[Y2] even for extreme values of X2, but E[Y2|X2] is still a better estimate of that portion of the sum than E[Y2] is.
But that’s not mysterious, that’s just regression to the mean.
I don’t understand—in what way is it regression to the mean?
Also, what does that have to do with my original comment, which is that you will do better by dropping high-variance terms?
You said you should drop X if you know that your estimate is high variance but that the actual values don’t vary much. Knowing that the actual value doesn’t vary much means your prior has low variance, while knowing that your estimate is noisy means that your prior for the error term has high variance.
So when you observe an estimate, you should attribute most of the variance to error, and regress your estimate substantially towards your prior mean. After doing that regression, you are better off including X than dropping it, as far as I can see. (Of course, if the regressed estimate is sufficiently small then it wasn’t even worth computing the estimate, but that’s a normal issue with allocating bounded computational resources and doesn’t depend on the variance of your estimate of X, just how large you expect the real value to be.)
Of course, any time you toss something out it corresponds to negligible weight. And of course, accuracy-wise, under limited computing power, you’re better off actually tossing it out and using the computing time elsewhere to increase the accuracy more.