This is kind of tricky, because it doesn’t seem like it should hold information, unless it correlates with R&W’s theta (probability of Y = 1).
If pi and theta were guaranteed independent, would Horwitz-Thompson in any meaningful way outperform Sum(Y) / Sum(R), that is, the average observed value of Y in cases where Y is observed?
The reason p(A | X) holds info is because it determines what Y we see. Say for a moment A was independent of X, so we saw Y if a fair coin came up heads (p(A = 0) = 0.5). Then the Ys we see are the same as the Ys we don’t see, because the coin doesn’t look at anything about Y to determine whether to come up heads.
But if the coin depends on X, the worry is the Ys we see may have particular Xs and not others. So if we just ignore the Ys we don’t see, we will get a biased view of the underlying Y based on the Ys we actually see based on P(A|X).
Somehow, to correctly deal with this bias, we must involve p(A|X) (explicitly or implicitly).
Sure. But if we know or suspect any correlation between A and Y, there’s nothing strange about the common information between them being expressed in the prior, right?
Ah, R&W’s pi function.
This is kind of tricky, because it doesn’t seem like it should hold information, unless it correlates with R&W’s theta (probability of Y = 1).
If pi and theta were guaranteed independent, would Horwitz-Thompson in any meaningful way outperform Sum(Y) / Sum(R), that is, the average observed value of Y in cases where Y is observed?
The reason p(A | X) holds info is because it determines what Y we see. Say for a moment A was independent of X, so we saw Y if a fair coin came up heads (p(A = 0) = 0.5). Then the Ys we see are the same as the Ys we don’t see, because the coin doesn’t look at anything about Y to determine whether to come up heads.
But if the coin depends on X, the worry is the Ys we see may have particular Xs and not others. So if we just ignore the Ys we don’t see, we will get a biased view of the underlying Y based on the Ys we actually see based on P(A|X).
Somehow, to correctly deal with this bias, we must involve p(A|X) (explicitly or implicitly).
Sure. But if we know or suspect any correlation between A and Y, there’s nothing strange about the common information between them being expressed in the prior, right?
Granted, H-T will have nice worst-case performance if we’re not confident about A and Y being independent, but that reduces to this debate http://lesswrong.com/lw/k9c/can_noise_have_power/.
I wrote up a pretty detailed reply to Luke’s question: http://lesswrong.com/lw/kd4/the_power_of_noise/