Ok. So the scenario is that you are sampling only from the population f(X)=1. Can you exhibit a simple example of the scenario in the section “A non-parametric Bayesian approach” with an explicit, simple class of functions g and distribution over them, for which the proposed procedure arrives at a better estimate of E[ Y | f(X)=1 ] than the sample average?
Is the idea that it is intended to demonstrate, simply that prior knowledge about the joint distribution of X and Y would, combined with the sample, give a better estimate than the sample alone?
Ok. So the scenario is that you are sampling only from the population f(X)=1.
EDIT: Correct, but you should not be too hung up on the issue of conditional sampling. The scenario would not change if we were sampling from the whole population. The important point is that we are trying to estimate a conditional mean of the form E[Y|f(X)=1]. This is a concept commonly seen in statistics. For example, the goal of non-parametric regression is to estimate a curve defined by f(x) = E[Y|X=x].
Can you exhibit a simple example of the scenario in the section “A non-parametric Bayesian approach” with an explicit, simple class of functions g and distribution over them, for which the proposed procedure arrives at a better estimate of E[ Y | f(X)=1 ] than the sample average?
The example I gave in my first reply (where g(x) is known to be either one of two known functions h(x) or j(x)) can easily be extended into the kind of fully specified counterexample you are looking for: I’m not going to bother to do it, because it’s very tedious to write out and it’s frankly a homework-level problem.
Is the idea that it is intended to demonstrate, simply that prior knowledge about the joint distribution of X and Y would, combined with the sample, give a better estimate than the sample alone?
The fact that prior information can improve your estimate is already well-known to statisticians. But statisticians disagree on whether or not you should try to model your prior information in the form of a Bayesian model. Some Bayesians have expressed the opinion that one should always do so. This post, along with Wasserman/Robbins/Ritov’s paper, provides counterexamples where the full non-parametric Bayesian model gives much worse results than the “naive” approach which ignores the prior.
The example I gave in my first reply (where g(x) is known to be either one of two known functions h(x) or j(x)) can easily be extended into the kind of fully specified counterexample you are looking for
That looks like a parametric model. There is one parameter, a binary variable that chooses h or j. A belief about that parameter is a probability p that h is the function. Yes, I can see that updating p on sight of the data may give a better estimate of E[Y|f(X)=1], which is known a priori to be either h(1) or j(1).
I expect it would be similar for small numbers of parameters also, such as a linear relationship between X and Y. Using the whole sample should improve on only looking at the subsample around f(X)=1.
However, in the nonparametric case (I think you are arguing) this goes wrong. The sample size is not large enough to estimate a model that gives a narrow estimate of E[Y|f(X)=1]. Am I understanding you yet?
It seems to me that the problem arises even before getting to the nonparametric case. If a parametric model has too many parameters to estimate from the sample, and the model predictions are everywhere sensitive to all of the parameters (so it cannot be approximated by any simpler model) then trying to estimate E[Y|f(X)=1] by first fitting the model, then predicting from the model, will also not work.
It so clearly will not work that it must be a wrong thing to do. It is not yet clear to me that a Bayesian statistician must do it anyway. The set {Y|f(X)=1} conveys information about E[Y|f(X)=1] directly, independently of the true model (assumed for the purpose of this discussion to be within the model space being considered). Estimating it via fitting a model ignores that information. Is there no Bayesian method of using it?
A partial answer to your question:
So if the prior is weak (as it is in my main post) you don’t model the function, and if the prior is strong, you model the function (and therefore make use of all the observations)? Where do you draw the line?
would be that the less the model helps, the less attention you pay it relative to calculating Mean{Y|f(X)=1}. I don’t have a mathematical formulation of how to do that though.
Ok. So the scenario is that you are sampling only from the population f(X)=1. Can you exhibit a simple example of the scenario in the section “A non-parametric Bayesian approach” with an explicit, simple class of functions g and distribution over them, for which the proposed procedure arrives at a better estimate of E[ Y | f(X)=1 ] than the sample average?
Is the idea that it is intended to demonstrate, simply that prior knowledge about the joint distribution of X and Y would, combined with the sample, give a better estimate than the sample alone?
EDIT: Correct, but you should not be too hung up on the issue of conditional sampling. The scenario would not change if we were sampling from the whole population. The important point is that we are trying to estimate a conditional mean of the form E[Y|f(X)=1]. This is a concept commonly seen in statistics. For example, the goal of non-parametric regression is to estimate a curve defined by f(x) = E[Y|X=x].
The example I gave in my first reply (where g(x) is known to be either one of two known functions h(x) or j(x)) can easily be extended into the kind of fully specified counterexample you are looking for: I’m not going to bother to do it, because it’s very tedious to write out and it’s frankly a homework-level problem.
The fact that prior information can improve your estimate is already well-known to statisticians. But statisticians disagree on whether or not you should try to model your prior information in the form of a Bayesian model. Some Bayesians have expressed the opinion that one should always do so. This post, along with Wasserman/Robbins/Ritov’s paper, provides counterexamples where the full non-parametric Bayesian model gives much worse results than the “naive” approach which ignores the prior.
That looks like a parametric model. There is one parameter, a binary variable that chooses h or j. A belief about that parameter is a probability p that h is the function. Yes, I can see that updating p on sight of the data may give a better estimate of E[Y|f(X)=1], which is known a priori to be either h(1) or j(1).
I expect it would be similar for small numbers of parameters also, such as a linear relationship between X and Y. Using the whole sample should improve on only looking at the subsample around f(X)=1.
However, in the nonparametric case (I think you are arguing) this goes wrong. The sample size is not large enough to estimate a model that gives a narrow estimate of E[Y|f(X)=1]. Am I understanding you yet?
It seems to me that the problem arises even before getting to the nonparametric case. If a parametric model has too many parameters to estimate from the sample, and the model predictions are everywhere sensitive to all of the parameters (so it cannot be approximated by any simpler model) then trying to estimate E[Y|f(X)=1] by first fitting the model, then predicting from the model, will also not work.
It so clearly will not work that it must be a wrong thing to do. It is not yet clear to me that a Bayesian statistician must do it anyway. The set {Y|f(X)=1} conveys information about E[Y|f(X)=1] directly, independently of the true model (assumed for the purpose of this discussion to be within the model space being considered). Estimating it via fitting a model ignores that information. Is there no Bayesian method of using it?
A partial answer to your question:
would be that the less the model helps, the less attention you pay it relative to calculating Mean{Y|f(X)=1}. I don’t have a mathematical formulation of how to do that though.