There are a couple of things I’m not understanding here.
Firstly, the example of the cancer survival test seems to have some inconsistency. The fitted model is said to give the right answer in 990 out of 1000 test cases. Where do you subsequently get the Beta(1000,2) distribution from? I am not seeing the source of that 2. And given that the model is right on exactly 99% of the test cases, how is the imaginary Bayesian coming up with a clearly wrong interval [0.996,0.9998]?
Secondly, in the later example of estimating E[ Y | f(X)=1 ], the method foisted on the Bayesian appears to involve estimating the whole of the function f. This seems to me an obviously misguided approach to the problem, whatever one’s views on statistical argument. Why cannot the Bayesian say, with the frequentist, it doesn’t matter what f is, I have been asked about the population for which f(X)=1. I do not need to model the process f by which that population was selected, only the behaviour of Y within that population? And then proceed in the usual way.
OP will correct me if I am wrong, but I think he is trying to restate the Robins/Wasserman example. You do not need to model f(X), but the point of that example is that you know f, but the conditional model for Y is very very complicated. So you either do a Bayesian approach with a prior and a likelihood for Y, or you just use Horvitz-Thompson with f.
I like to think of that example using causal inference: you want to estimate the causal effect p(Y | do(A)) of A on Y when the policy for assigning treatment A: p(A | C) is known exactly, but p(Y | A, C) is super complex. Likelihood-based methods like being Bayesian will use \sum_C p(Y | A, C) p(C). But you can just look at \sum{samples i} Yi 1/p(A | C) to get the same thing and avoid modeling p(Y | A,C). But doing that isn’t Bayesian.
My example is very similar to the Robbins/Wasserman example, but you end up drawing different conclusions. Robbins/Wasserman show that you can’t make sense of importance sampling in a Bayesian framework. My example shows that you can’t make sense of “conditional sampling” in a Bayesian framework. The goal of importance sampling is to estimate E[Y], while the goal of conditional sampling is to estimate E[Y|event] for some event.
We did talk about this before, that’s how I first learnt of the R/W example.
I think these are isomorphic, estimating E[Y] if Y is missing at random conditional on C is the same as estimating E[Y | do(a)] = E[Y | “we assign you to a given C”].
“Causal inference is a missing data problem, and missing data is a causal inference problem.”
Yes, I think you are missing something (although it is true that causal inference is a missing data problem).
It may be easier to think in terms of the potential outcomes model. Y0 is the outcome is no treatment, Y1 is the outcome of treatment, you only ever observe either Y0 or Y1, depending on whether D=0 or 1. Generally you are trying to estimate E[Y1] or E[Y0] or their difference.
The point is that the quantity Robbins and Wasserman are trying to estimate, E[Y], does not depend on the importance sampling distribution. Whereas the quantity I am trying to estimate, E[Y|f(X)], does depend on f. Changing f changes the population quantity to be estimated.
It is true that sometimes people in causal inference are interested in estimating things like E[Y1 - Y0|D], ” e.g. the treatment effect on the treated.” However this is still different from my setup because D is a random variable, as opposed to an arbitrary function of the known variables like f(X).
Not following. By “importance sampling distribution” do you mean the distribution that tells you whether Y is missing or not? If so changing this distribution will change what you have to do to estimate E[Y] in the Robins/Wasserman case. For example, if you change the distributiion to just depend on an independent coin flip you move from “MAR” to “MCAR” (in causal inference from “conditional ignorablity” to “ignorability.”) Then your procedure depends on this distribution (but your target does not, this is true). Similarly “p(y | do(a))” does not change, but the functional of the observed data equal to “p(y | do(a))” will change if you change the treatment assignment distribution.
(Btw, people do versions of ETT where D is complicated and not a simple treatment event. Actually I have something in a recent draft of mine called “effect of treatment on the indirectly treated” that’s like that).
By “importance sampling distribution” do you mean the distribution that tells you whether Y is missing or not?
Right. You could say the cases of Y1|D=1 you observe in the population are an importance sample from Y1, the hypothetical population that would result if everyone in the population were treated. E[Y1], the quantity to be estimated, is the mean of this hypothetical population. The importance sampling weights are q(x) = Pr[D=1|x]/p(x) where p(x) is the marginal distribution (ie you invert these weights to get the average), the importance sampling distribution is the conditional density of X|D=1.
I think Robins and Ritov has a theorem (cited in your blog link) claiming to get E[Y] if Y is MAR you need to incorporate info about 1/p(x) somewhere into your procedure (?the prior?) or you don’t get uniform consistency. Is your claim that you can get around this via some hierarchical model, e.g.:
How about a hierarchical model, where first we draw a parameter p from the uniform distribution, and
then draw g(x) from the uniform distribution over smooth functions with mean value equal to p? This
gets you non-constant g(x) in the posterior, while your posteriors of E[g(X)] converge to the truth as
quickly as in the Binomial example. Arguing backwards, I would say that such a prior comes closer to
capturing my beliefs.
Is this just intuition or did you write this up somewhere? That sounds very interesting.
Why did you start thinking about conditional sampling at all? If estimating E[Y] via importance sampling/inverse weights/covariate adjustment is already something of a difficulty for Bayesians, why think about E[Y | event]? Isn’t that trivially at least as hard?
The confusion may come from mixing up my setup and Robins/Ritov’s setup. There is no missing data in my setup.
I could write up my intuition for the hierarchical model. It’s an almost trivial result if you don’t assume smoothness, since for any x1,...,xn the parameters g(x1)...g(xn) are conditionally independent given p and distributed as F(p), where F is the maximum entropy Beta with mean p (I don’t know the form of the parameters alpha(p) and beta(p) off-hand). Smoothness makes the proof much more difficult, but based on high-dimensional intuition one can be sure that it won’t change the result substantially.
It is quite possible that estimating E[Y] and E[Y|event] are “equivalently hard”, but they are both interesting problems with different quite different real-world applications. The reason I chose to write about estimating E[Y|event] is because I think it is easier to explain than importance sampling.
I do not need to model the process f by which that population was selected, only the behaviour of Y within that population?
There are some (including myself and presumably some others on this board) who see this practice as epistemologically dubious. First, how do you decide which aspects of the problem to incorporate into your model? Why should one only try to model E[Y|f(X)=1] and not the underlying function g(x)=E[Y|x]? If you actually had very strong prior information about g(x), say that “I know g(x)=h(x) with probability 1⁄2 or g(x) = j(x) with probability 1/2” where h(x) and j(x) are known functions, then in that case most statisticians would incorporate the underlying function g(x) in the model; and in that case, data for observations with f(X)=0 might be informative for whether g(x) = h(x) or g(x) = j(x). So if the prior is weak (as it is in my main post) you don’t model the function, and if the prior is strong, you model the function (and therefore make use of all the observations)? Where do you draw the line?
I agree, most statisticians would not model g(x) in the cancer example. But is that because they have limited time and resources (and are possibly lazy) and because using an overcomplicated model would confuse their audience, anyways? Or because they legitimately think that it’s an objective mistake to use a model involving g(x)?
Why should one only try to model E[Y|f(X)=1] and not the underlying function g(x)=E[Y|x]?
What would it tell you if you could? The problem is to estimate Y for a certain population. Therefore, look at that population. I am not seeing a reason why one would consider modelling g, so I am at a loss to answer the question, why not model g?
Jaynes and a few others generally write things like E[ Y | I ] or P( Y | I ) where I represents “all of your background knowledge”, not further analysed. f(X)=1 is playing the role of I here. It’s a placeholder for the stuff we aren’t modelling and within which the statistical reasoning takes place.
Suppose f was a very simple function, for example, the identity. You are asked to estimate E[ Y | X=1 ]. What do the Bayesian and the frequentist do in this case? They are still only being asked about the population for which X=1. Can either of them get better information about E[ Y | X=1 ] by looking (also) at samples where X is not 1?
The example is a simplification of Wasserman’s; I’m not sure if a similar answer can be made there.
BTW, I’m not a statistician, and these aren’t rhetorical questions.
ETA: Here’s an even simpler example, in which it might be possible to demonstrate mathematically the answer to the question, can better information be obtained about E[ Y | X=1 ] by looking at members of the population where X is not 1? Suppose it is given that X and Y have a bivariate normal distribution, with unknown parameters. You take a sample of 1000, and are given a choice of taking it either from the whole population, or from that sliver for which X is in some range 1 +/- ε for ε very small compared with the standard deviation of X. You then use whatever tools you prefer to estimate E[ Y | X=1 ]. Which method of sampling will allow a better estimate?
ETA2: Here is my own answer to my last question, after looking up some formulas concerning linear regression. Let Y1 be the mean of Y in a sample drawn from a narrow neighbourhood of X=1, and let Y2 be the estimate of E[ Y | X=1 ] obtained by doing linear regression on a sample drawn from the whole population. Both samples have the same size n, assumed large enough to ignore small-sample corrections. Then the ratio of the standard error of Y2 to that of Y1 is sqrt( 1 + k^2 ), where k is the difference between 1 and E[X], in units of the standard deviation of X. So at least for this toy example, a narrow sample always works at least as well as a broad one, and is almost always better. Is this a general fact, or are there equally simple examples where the opposite is found?
ETA3: I might have such an example. Suppose that the distribution of Y|X is a + bX + ε(X), where ε(X) is a random variable whose mean is always zero but whose variance is high in the neighbourhood of X=1 and low elsewhere. Then a linear regression on a sample from the full population may allow a better estimate of E[Y|X] than a sample from the neighbourhood of X=1. A sample that avoids that region may do better still. Intuitively, if there’s a lot of noise where you want to look, extrapolate from where there’s less noise.
But it’s not clear to me that this bears on the Bayesian vs. frequentist matter. Both of them are faced with the decision to take a wide sample or a narrow one. The frequentist can’t insist that the Bayesian takes notice of structure in the problem that the frequentist chooses to ignore.
There are some (including myself and presumably some others on this board) who see this practice as epistemologically dubious. First, how do you decide which aspects of the problem to incorporate into your model?
That question must be directed at both the Bayesian and the frequentist. In my other comment I gave two toy examples, in one of which looking at a wider sample is provably inferior to looking only at f(X)=1, and one in which the reverse is the case. Anyone faced with the problem of estimating E[Y|f(X)=1] needs to decide, somehow, what observations to make.
How do a Bayesian or a frequentist make that decision?
I didn’t reply to your other comment because although you are making valid points, you have veered off-topic since your initial comment. The question of “which observations to make?” is not a question of inference but rather one of experimental design. If you think this question is relevant to the discussion, it means that you neither understand the original post nor my reply to your initial comment. The questions I am asking have to do with what to infer after the observations have already been made.
Ok. So the scenario is that you are sampling only from the population f(X)=1. Can you exhibit a simple example of the scenario in the section “A non-parametric Bayesian approach” with an explicit, simple class of functions g and distribution over them, for which the proposed procedure arrives at a better estimate of E[ Y | f(X)=1 ] than the sample average?
Is the idea that it is intended to demonstrate, simply that prior knowledge about the joint distribution of X and Y would, combined with the sample, give a better estimate than the sample alone?
Ok. So the scenario is that you are sampling only from the population f(X)=1.
EDIT: Correct, but you should not be too hung up on the issue of conditional sampling. The scenario would not change if we were sampling from the whole population. The important point is that we are trying to estimate a conditional mean of the form E[Y|f(X)=1]. This is a concept commonly seen in statistics. For example, the goal of non-parametric regression is to estimate a curve defined by f(x) = E[Y|X=x].
Can you exhibit a simple example of the scenario in the section “A non-parametric Bayesian approach” with an explicit, simple class of functions g and distribution over them, for which the proposed procedure arrives at a better estimate of E[ Y | f(X)=1 ] than the sample average?
The example I gave in my first reply (where g(x) is known to be either one of two known functions h(x) or j(x)) can easily be extended into the kind of fully specified counterexample you are looking for: I’m not going to bother to do it, because it’s very tedious to write out and it’s frankly a homework-level problem.
Is the idea that it is intended to demonstrate, simply that prior knowledge about the joint distribution of X and Y would, combined with the sample, give a better estimate than the sample alone?
The fact that prior information can improve your estimate is already well-known to statisticians. But statisticians disagree on whether or not you should try to model your prior information in the form of a Bayesian model. Some Bayesians have expressed the opinion that one should always do so. This post, along with Wasserman/Robbins/Ritov’s paper, provides counterexamples where the full non-parametric Bayesian model gives much worse results than the “naive” approach which ignores the prior.
The example I gave in my first reply (where g(x) is known to be either one of two known functions h(x) or j(x)) can easily be extended into the kind of fully specified counterexample you are looking for
That looks like a parametric model. There is one parameter, a binary variable that chooses h or j. A belief about that parameter is a probability p that h is the function. Yes, I can see that updating p on sight of the data may give a better estimate of E[Y|f(X)=1], which is known a priori to be either h(1) or j(1).
I expect it would be similar for small numbers of parameters also, such as a linear relationship between X and Y. Using the whole sample should improve on only looking at the subsample around f(X)=1.
However, in the nonparametric case (I think you are arguing) this goes wrong. The sample size is not large enough to estimate a model that gives a narrow estimate of E[Y|f(X)=1]. Am I understanding you yet?
It seems to me that the problem arises even before getting to the nonparametric case. If a parametric model has too many parameters to estimate from the sample, and the model predictions are everywhere sensitive to all of the parameters (so it cannot be approximated by any simpler model) then trying to estimate E[Y|f(X)=1] by first fitting the model, then predicting from the model, will also not work.
It so clearly will not work that it must be a wrong thing to do. It is not yet clear to me that a Bayesian statistician must do it anyway. The set {Y|f(X)=1} conveys information about E[Y|f(X)=1] directly, independently of the true model (assumed for the purpose of this discussion to be within the model space being considered). Estimating it via fitting a model ignores that information. Is there no Bayesian method of using it?
A partial answer to your question:
So if the prior is weak (as it is in my main post) you don’t model the function, and if the prior is strong, you model the function (and therefore make use of all the observations)? Where do you draw the line?
would be that the less the model helps, the less attention you pay it relative to calculating Mean{Y|f(X)=1}. I don’t have a mathematical formulation of how to do that though.
There are a couple of things I’m not understanding here.
Firstly, the example of the cancer survival test seems to have some inconsistency. The fitted model is said to give the right answer in 990 out of 1000 test cases. Where do you subsequently get the Beta(1000,2) distribution from? I am not seeing the source of that 2. And given that the model is right on exactly 99% of the test cases, how is the imaginary Bayesian coming up with a clearly wrong interval [0.996,0.9998]?
Secondly, in the later example of estimating E[ Y | f(X)=1 ], the method foisted on the Bayesian appears to involve estimating the whole of the function f. This seems to me an obviously misguided approach to the problem, whatever one’s views on statistical argument. Why cannot the Bayesian say, with the frequentist, it doesn’t matter what f is, I have been asked about the population for which f(X)=1. I do not need to model the process f by which that population was selected, only the behaviour of Y within that population? And then proceed in the usual way.
OP will correct me if I am wrong, but I think he is trying to restate the Robins/Wasserman example. You do not need to model f(X), but the point of that example is that you know f, but the conditional model for Y is very very complicated. So you either do a Bayesian approach with a prior and a likelihood for Y, or you just use Horvitz-Thompson with f.
I like to think of that example using causal inference: you want to estimate the causal effect p(Y | do(A)) of A on Y when the policy for assigning treatment A: p(A | C) is known exactly, but p(Y | A, C) is super complex. Likelihood-based methods like being Bayesian will use \sum_C p(Y | A, C) p(C). But you can just look at \sum{samples i} Yi 1/p(A | C) to get the same thing and avoid modeling p(Y | A,C). But doing that isn’t Bayesian.
See also this:
http://www.biostat.harvard.edu/robins/coda.pdf
I think we talked about this before.
My example is very similar to the Robbins/Wasserman example, but you end up drawing different conclusions. Robbins/Wasserman show that you can’t make sense of importance sampling in a Bayesian framework. My example shows that you can’t make sense of “conditional sampling” in a Bayesian framework. The goal of importance sampling is to estimate E[Y], while the goal of conditional sampling is to estimate E[Y|event] for some event.
We did talk about this before, that’s how I first learnt of the R/W example.
I think these are isomorphic, estimating E[Y] if Y is missing at random conditional on C is the same as estimating E[Y | do(a)] = E[Y | “we assign you to a given C”].
“Causal inference is a missing data problem, and missing data is a causal inference problem.”
Or I may be “missing” something. :)
Yes, I think you are missing something (although it is true that causal inference is a missing data problem).
It may be easier to think in terms of the potential outcomes model. Y0 is the outcome is no treatment, Y1 is the outcome of treatment, you only ever observe either Y0 or Y1, depending on whether D=0 or 1. Generally you are trying to estimate E[Y1] or E[Y0] or their difference.
The point is that the quantity Robbins and Wasserman are trying to estimate, E[Y], does not depend on the importance sampling distribution. Whereas the quantity I am trying to estimate, E[Y|f(X)], does depend on f. Changing f changes the population quantity to be estimated.
It is true that sometimes people in causal inference are interested in estimating things like E[Y1 - Y0|D], ” e.g. the treatment effect on the treated.” However this is still different from my setup because D is a random variable, as opposed to an arbitrary function of the known variables like f(X).
Not following. By “importance sampling distribution” do you mean the distribution that tells you whether Y is missing or not? If so changing this distribution will change what you have to do to estimate E[Y] in the Robins/Wasserman case. For example, if you change the distributiion to just depend on an independent coin flip you move from “MAR” to “MCAR” (in causal inference from “conditional ignorablity” to “ignorability.”) Then your procedure depends on this distribution (but your target does not, this is true). Similarly “p(y | do(a))” does not change, but the functional of the observed data equal to “p(y | do(a))” will change if you change the treatment assignment distribution.
(Btw, people do versions of ETT where D is complicated and not a simple treatment event. Actually I have something in a recent draft of mine called “effect of treatment on the indirectly treated” that’s like that).
Right. You could say the cases of Y1|D=1 you observe in the population are an importance sample from Y1, the hypothetical population that would result if everyone in the population were treated. E[Y1], the quantity to be estimated, is the mean of this hypothetical population. The importance sampling weights are q(x) = Pr[D=1|x]/p(x) where p(x) is the marginal distribution (ie you invert these weights to get the average), the importance sampling distribution is the conditional density of X|D=1.
Still slightly confused.
I think Robins and Ritov has a theorem (cited in your blog link) claiming to get E[Y] if Y is MAR you need to incorporate info about 1/p(x) somewhere into your procedure (?the prior?) or you don’t get uniform consistency. Is your claim that you can get around this via some hierarchical model, e.g.:
Is this just intuition or did you write this up somewhere? That sounds very interesting.
Why did you start thinking about conditional sampling at all? If estimating E[Y] via importance sampling/inverse weights/covariate adjustment is already something of a difficulty for Bayesians, why think about E[Y | event]? Isn’t that trivially at least as hard?
The confusion may come from mixing up my setup and Robins/Ritov’s setup. There is no missing data in my setup.
I could write up my intuition for the hierarchical model. It’s an almost trivial result if you don’t assume smoothness, since for any x1,...,xn the parameters g(x1)...g(xn) are conditionally independent given p and distributed as F(p), where F is the maximum entropy Beta with mean p (I don’t know the form of the parameters alpha(p) and beta(p) off-hand). Smoothness makes the proof much more difficult, but based on high-dimensional intuition one can be sure that it won’t change the result substantially.
It is quite possible that estimating E[Y] and E[Y|event] are “equivalently hard”, but they are both interesting problems with different quite different real-world applications. The reason I chose to write about estimating E[Y|event] is because I think it is easier to explain than importance sampling.
There are some (including myself and presumably some others on this board) who see this practice as epistemologically dubious. First, how do you decide which aspects of the problem to incorporate into your model? Why should one only try to model E[Y|f(X)=1] and not the underlying function g(x)=E[Y|x]? If you actually had very strong prior information about g(x), say that “I know g(x)=h(x) with probability 1⁄2 or g(x) = j(x) with probability 1/2” where h(x) and j(x) are known functions, then in that case most statisticians would incorporate the underlying function g(x) in the model; and in that case, data for observations with f(X)=0 might be informative for whether g(x) = h(x) or g(x) = j(x). So if the prior is weak (as it is in my main post) you don’t model the function, and if the prior is strong, you model the function (and therefore make use of all the observations)? Where do you draw the line?
I agree, most statisticians would not model g(x) in the cancer example. But is that because they have limited time and resources (and are possibly lazy) and because using an overcomplicated model would confuse their audience, anyways? Or because they legitimately think that it’s an objective mistake to use a model involving g(x)?
What would it tell you if you could? The problem is to estimate Y for a certain population. Therefore, look at that population. I am not seeing a reason why one would consider modelling g, so I am at a loss to answer the question, why not model g?
Jaynes and a few others generally write things like E[ Y | I ] or P( Y | I ) where I represents “all of your background knowledge”, not further analysed. f(X)=1 is playing the role of I here. It’s a placeholder for the stuff we aren’t modelling and within which the statistical reasoning takes place.
Suppose f was a very simple function, for example, the identity. You are asked to estimate E[ Y | X=1 ]. What do the Bayesian and the frequentist do in this case? They are still only being asked about the population for which X=1. Can either of them get better information about E[ Y | X=1 ] by looking (also) at samples where X is not 1?
The example is a simplification of Wasserman’s; I’m not sure if a similar answer can be made there.
BTW, I’m not a statistician, and these aren’t rhetorical questions.
ETA: Here’s an even simpler example, in which it might be possible to demonstrate mathematically the answer to the question, can better information be obtained about E[ Y | X=1 ] by looking at members of the population where X is not 1? Suppose it is given that X and Y have a bivariate normal distribution, with unknown parameters. You take a sample of 1000, and are given a choice of taking it either from the whole population, or from that sliver for which X is in some range 1 +/- ε for ε very small compared with the standard deviation of X. You then use whatever tools you prefer to estimate E[ Y | X=1 ]. Which method of sampling will allow a better estimate?
ETA2: Here is my own answer to my last question, after looking up some formulas concerning linear regression. Let Y1 be the mean of Y in a sample drawn from a narrow neighbourhood of X=1, and let Y2 be the estimate of E[ Y | X=1 ] obtained by doing linear regression on a sample drawn from the whole population. Both samples have the same size n, assumed large enough to ignore small-sample corrections. Then the ratio of the standard error of Y2 to that of Y1 is sqrt( 1 + k^2 ), where k is the difference between 1 and E[X], in units of the standard deviation of X. So at least for this toy example, a narrow sample always works at least as well as a broad one, and is almost always better. Is this a general fact, or are there equally simple examples where the opposite is found?
ETA3: I might have such an example. Suppose that the distribution of Y|X is a + bX + ε(X), where ε(X) is a random variable whose mean is always zero but whose variance is high in the neighbourhood of X=1 and low elsewhere. Then a linear regression on a sample from the full population may allow a better estimate of E[Y|X] than a sample from the neighbourhood of X=1. A sample that avoids that region may do better still. Intuitively, if there’s a lot of noise where you want to look, extrapolate from where there’s less noise.
But it’s not clear to me that this bears on the Bayesian vs. frequentist matter. Both of them are faced with the decision to take a wide sample or a narrow one. The frequentist can’t insist that the Bayesian takes notice of structure in the problem that the frequentist chooses to ignore.
That question must be directed at both the Bayesian and the frequentist. In my other comment I gave two toy examples, in one of which looking at a wider sample is provably inferior to looking only at f(X)=1, and one in which the reverse is the case. Anyone faced with the problem of estimating E[Y|f(X)=1] needs to decide, somehow, what observations to make.
How do a Bayesian or a frequentist make that decision?
I didn’t reply to your other comment because although you are making valid points, you have veered off-topic since your initial comment. The question of “which observations to make?” is not a question of inference but rather one of experimental design. If you think this question is relevant to the discussion, it means that you neither understand the original post nor my reply to your initial comment. The questions I am asking have to do with what to infer after the observations have already been made.
Ok. So the scenario is that you are sampling only from the population f(X)=1. Can you exhibit a simple example of the scenario in the section “A non-parametric Bayesian approach” with an explicit, simple class of functions g and distribution over them, for which the proposed procedure arrives at a better estimate of E[ Y | f(X)=1 ] than the sample average?
Is the idea that it is intended to demonstrate, simply that prior knowledge about the joint distribution of X and Y would, combined with the sample, give a better estimate than the sample alone?
EDIT: Correct, but you should not be too hung up on the issue of conditional sampling. The scenario would not change if we were sampling from the whole population. The important point is that we are trying to estimate a conditional mean of the form E[Y|f(X)=1]. This is a concept commonly seen in statistics. For example, the goal of non-parametric regression is to estimate a curve defined by f(x) = E[Y|X=x].
The example I gave in my first reply (where g(x) is known to be either one of two known functions h(x) or j(x)) can easily be extended into the kind of fully specified counterexample you are looking for: I’m not going to bother to do it, because it’s very tedious to write out and it’s frankly a homework-level problem.
The fact that prior information can improve your estimate is already well-known to statisticians. But statisticians disagree on whether or not you should try to model your prior information in the form of a Bayesian model. Some Bayesians have expressed the opinion that one should always do so. This post, along with Wasserman/Robbins/Ritov’s paper, provides counterexamples where the full non-parametric Bayesian model gives much worse results than the “naive” approach which ignores the prior.
That looks like a parametric model. There is one parameter, a binary variable that chooses h or j. A belief about that parameter is a probability p that h is the function. Yes, I can see that updating p on sight of the data may give a better estimate of E[Y|f(X)=1], which is known a priori to be either h(1) or j(1).
I expect it would be similar for small numbers of parameters also, such as a linear relationship between X and Y. Using the whole sample should improve on only looking at the subsample around f(X)=1.
However, in the nonparametric case (I think you are arguing) this goes wrong. The sample size is not large enough to estimate a model that gives a narrow estimate of E[Y|f(X)=1]. Am I understanding you yet?
It seems to me that the problem arises even before getting to the nonparametric case. If a parametric model has too many parameters to estimate from the sample, and the model predictions are everywhere sensitive to all of the parameters (so it cannot be approximated by any simpler model) then trying to estimate E[Y|f(X)=1] by first fitting the model, then predicting from the model, will also not work.
It so clearly will not work that it must be a wrong thing to do. It is not yet clear to me that a Bayesian statistician must do it anyway. The set {Y|f(X)=1} conveys information about E[Y|f(X)=1] directly, independently of the true model (assumed for the purpose of this discussion to be within the model space being considered). Estimating it via fitting a model ignores that information. Is there no Bayesian method of using it?
A partial answer to your question:
would be that the less the model helps, the less attention you pay it relative to calculating Mean{Y|f(X)=1}. I don’t have a mathematical formulation of how to do that though.