I do not need to model the process f by which that population was selected, only the behaviour of Y within that population?
There are some (including myself and presumably some others on this board) who see this practice as epistemologically dubious. First, how do you decide which aspects of the problem to incorporate into your model? Why should one only try to model E[Y|f(X)=1] and not the underlying function g(x)=E[Y|x]? If you actually had very strong prior information about g(x), say that “I know g(x)=h(x) with probability 1⁄2 or g(x) = j(x) with probability 1/2” where h(x) and j(x) are known functions, then in that case most statisticians would incorporate the underlying function g(x) in the model; and in that case, data for observations with f(X)=0 might be informative for whether g(x) = h(x) or g(x) = j(x). So if the prior is weak (as it is in my main post) you don’t model the function, and if the prior is strong, you model the function (and therefore make use of all the observations)? Where do you draw the line?
I agree, most statisticians would not model g(x) in the cancer example. But is that because they have limited time and resources (and are possibly lazy) and because using an overcomplicated model would confuse their audience, anyways? Or because they legitimately think that it’s an objective mistake to use a model involving g(x)?
Why should one only try to model E[Y|f(X)=1] and not the underlying function g(x)=E[Y|x]?
What would it tell you if you could? The problem is to estimate Y for a certain population. Therefore, look at that population. I am not seeing a reason why one would consider modelling g, so I am at a loss to answer the question, why not model g?
Jaynes and a few others generally write things like E[ Y | I ] or P( Y | I ) where I represents “all of your background knowledge”, not further analysed. f(X)=1 is playing the role of I here. It’s a placeholder for the stuff we aren’t modelling and within which the statistical reasoning takes place.
Suppose f was a very simple function, for example, the identity. You are asked to estimate E[ Y | X=1 ]. What do the Bayesian and the frequentist do in this case? They are still only being asked about the population for which X=1. Can either of them get better information about E[ Y | X=1 ] by looking (also) at samples where X is not 1?
The example is a simplification of Wasserman’s; I’m not sure if a similar answer can be made there.
BTW, I’m not a statistician, and these aren’t rhetorical questions.
ETA: Here’s an even simpler example, in which it might be possible to demonstrate mathematically the answer to the question, can better information be obtained about E[ Y | X=1 ] by looking at members of the population where X is not 1? Suppose it is given that X and Y have a bivariate normal distribution, with unknown parameters. You take a sample of 1000, and are given a choice of taking it either from the whole population, or from that sliver for which X is in some range 1 +/- ε for ε very small compared with the standard deviation of X. You then use whatever tools you prefer to estimate E[ Y | X=1 ]. Which method of sampling will allow a better estimate?
ETA2: Here is my own answer to my last question, after looking up some formulas concerning linear regression. Let Y1 be the mean of Y in a sample drawn from a narrow neighbourhood of X=1, and let Y2 be the estimate of E[ Y | X=1 ] obtained by doing linear regression on a sample drawn from the whole population. Both samples have the same size n, assumed large enough to ignore small-sample corrections. Then the ratio of the standard error of Y2 to that of Y1 is sqrt( 1 + k^2 ), where k is the difference between 1 and E[X], in units of the standard deviation of X. So at least for this toy example, a narrow sample always works at least as well as a broad one, and is almost always better. Is this a general fact, or are there equally simple examples where the opposite is found?
ETA3: I might have such an example. Suppose that the distribution of Y|X is a + bX + ε(X), where ε(X) is a random variable whose mean is always zero but whose variance is high in the neighbourhood of X=1 and low elsewhere. Then a linear regression on a sample from the full population may allow a better estimate of E[Y|X] than a sample from the neighbourhood of X=1. A sample that avoids that region may do better still. Intuitively, if there’s a lot of noise where you want to look, extrapolate from where there’s less noise.
But it’s not clear to me that this bears on the Bayesian vs. frequentist matter. Both of them are faced with the decision to take a wide sample or a narrow one. The frequentist can’t insist that the Bayesian takes notice of structure in the problem that the frequentist chooses to ignore.
There are some (including myself and presumably some others on this board) who see this practice as epistemologically dubious. First, how do you decide which aspects of the problem to incorporate into your model?
That question must be directed at both the Bayesian and the frequentist. In my other comment I gave two toy examples, in one of which looking at a wider sample is provably inferior to looking only at f(X)=1, and one in which the reverse is the case. Anyone faced with the problem of estimating E[Y|f(X)=1] needs to decide, somehow, what observations to make.
How do a Bayesian or a frequentist make that decision?
I didn’t reply to your other comment because although you are making valid points, you have veered off-topic since your initial comment. The question of “which observations to make?” is not a question of inference but rather one of experimental design. If you think this question is relevant to the discussion, it means that you neither understand the original post nor my reply to your initial comment. The questions I am asking have to do with what to infer after the observations have already been made.
Ok. So the scenario is that you are sampling only from the population f(X)=1. Can you exhibit a simple example of the scenario in the section “A non-parametric Bayesian approach” with an explicit, simple class of functions g and distribution over them, for which the proposed procedure arrives at a better estimate of E[ Y | f(X)=1 ] than the sample average?
Is the idea that it is intended to demonstrate, simply that prior knowledge about the joint distribution of X and Y would, combined with the sample, give a better estimate than the sample alone?
Ok. So the scenario is that you are sampling only from the population f(X)=1.
EDIT: Correct, but you should not be too hung up on the issue of conditional sampling. The scenario would not change if we were sampling from the whole population. The important point is that we are trying to estimate a conditional mean of the form E[Y|f(X)=1]. This is a concept commonly seen in statistics. For example, the goal of non-parametric regression is to estimate a curve defined by f(x) = E[Y|X=x].
Can you exhibit a simple example of the scenario in the section “A non-parametric Bayesian approach” with an explicit, simple class of functions g and distribution over them, for which the proposed procedure arrives at a better estimate of E[ Y | f(X)=1 ] than the sample average?
The example I gave in my first reply (where g(x) is known to be either one of two known functions h(x) or j(x)) can easily be extended into the kind of fully specified counterexample you are looking for: I’m not going to bother to do it, because it’s very tedious to write out and it’s frankly a homework-level problem.
Is the idea that it is intended to demonstrate, simply that prior knowledge about the joint distribution of X and Y would, combined with the sample, give a better estimate than the sample alone?
The fact that prior information can improve your estimate is already well-known to statisticians. But statisticians disagree on whether or not you should try to model your prior information in the form of a Bayesian model. Some Bayesians have expressed the opinion that one should always do so. This post, along with Wasserman/Robbins/Ritov’s paper, provides counterexamples where the full non-parametric Bayesian model gives much worse results than the “naive” approach which ignores the prior.
The example I gave in my first reply (where g(x) is known to be either one of two known functions h(x) or j(x)) can easily be extended into the kind of fully specified counterexample you are looking for
That looks like a parametric model. There is one parameter, a binary variable that chooses h or j. A belief about that parameter is a probability p that h is the function. Yes, I can see that updating p on sight of the data may give a better estimate of E[Y|f(X)=1], which is known a priori to be either h(1) or j(1).
I expect it would be similar for small numbers of parameters also, such as a linear relationship between X and Y. Using the whole sample should improve on only looking at the subsample around f(X)=1.
However, in the nonparametric case (I think you are arguing) this goes wrong. The sample size is not large enough to estimate a model that gives a narrow estimate of E[Y|f(X)=1]. Am I understanding you yet?
It seems to me that the problem arises even before getting to the nonparametric case. If a parametric model has too many parameters to estimate from the sample, and the model predictions are everywhere sensitive to all of the parameters (so it cannot be approximated by any simpler model) then trying to estimate E[Y|f(X)=1] by first fitting the model, then predicting from the model, will also not work.
It so clearly will not work that it must be a wrong thing to do. It is not yet clear to me that a Bayesian statistician must do it anyway. The set {Y|f(X)=1} conveys information about E[Y|f(X)=1] directly, independently of the true model (assumed for the purpose of this discussion to be within the model space being considered). Estimating it via fitting a model ignores that information. Is there no Bayesian method of using it?
A partial answer to your question:
So if the prior is weak (as it is in my main post) you don’t model the function, and if the prior is strong, you model the function (and therefore make use of all the observations)? Where do you draw the line?
would be that the less the model helps, the less attention you pay it relative to calculating Mean{Y|f(X)=1}. I don’t have a mathematical formulation of how to do that though.
There are some (including myself and presumably some others on this board) who see this practice as epistemologically dubious. First, how do you decide which aspects of the problem to incorporate into your model? Why should one only try to model E[Y|f(X)=1] and not the underlying function g(x)=E[Y|x]? If you actually had very strong prior information about g(x), say that “I know g(x)=h(x) with probability 1⁄2 or g(x) = j(x) with probability 1/2” where h(x) and j(x) are known functions, then in that case most statisticians would incorporate the underlying function g(x) in the model; and in that case, data for observations with f(X)=0 might be informative for whether g(x) = h(x) or g(x) = j(x). So if the prior is weak (as it is in my main post) you don’t model the function, and if the prior is strong, you model the function (and therefore make use of all the observations)? Where do you draw the line?
I agree, most statisticians would not model g(x) in the cancer example. But is that because they have limited time and resources (and are possibly lazy) and because using an overcomplicated model would confuse their audience, anyways? Or because they legitimately think that it’s an objective mistake to use a model involving g(x)?
What would it tell you if you could? The problem is to estimate Y for a certain population. Therefore, look at that population. I am not seeing a reason why one would consider modelling g, so I am at a loss to answer the question, why not model g?
Jaynes and a few others generally write things like E[ Y | I ] or P( Y | I ) where I represents “all of your background knowledge”, not further analysed. f(X)=1 is playing the role of I here. It’s a placeholder for the stuff we aren’t modelling and within which the statistical reasoning takes place.
Suppose f was a very simple function, for example, the identity. You are asked to estimate E[ Y | X=1 ]. What do the Bayesian and the frequentist do in this case? They are still only being asked about the population for which X=1. Can either of them get better information about E[ Y | X=1 ] by looking (also) at samples where X is not 1?
The example is a simplification of Wasserman’s; I’m not sure if a similar answer can be made there.
BTW, I’m not a statistician, and these aren’t rhetorical questions.
ETA: Here’s an even simpler example, in which it might be possible to demonstrate mathematically the answer to the question, can better information be obtained about E[ Y | X=1 ] by looking at members of the population where X is not 1? Suppose it is given that X and Y have a bivariate normal distribution, with unknown parameters. You take a sample of 1000, and are given a choice of taking it either from the whole population, or from that sliver for which X is in some range 1 +/- ε for ε very small compared with the standard deviation of X. You then use whatever tools you prefer to estimate E[ Y | X=1 ]. Which method of sampling will allow a better estimate?
ETA2: Here is my own answer to my last question, after looking up some formulas concerning linear regression. Let Y1 be the mean of Y in a sample drawn from a narrow neighbourhood of X=1, and let Y2 be the estimate of E[ Y | X=1 ] obtained by doing linear regression on a sample drawn from the whole population. Both samples have the same size n, assumed large enough to ignore small-sample corrections. Then the ratio of the standard error of Y2 to that of Y1 is sqrt( 1 + k^2 ), where k is the difference between 1 and E[X], in units of the standard deviation of X. So at least for this toy example, a narrow sample always works at least as well as a broad one, and is almost always better. Is this a general fact, or are there equally simple examples where the opposite is found?
ETA3: I might have such an example. Suppose that the distribution of Y|X is a + bX + ε(X), where ε(X) is a random variable whose mean is always zero but whose variance is high in the neighbourhood of X=1 and low elsewhere. Then a linear regression on a sample from the full population may allow a better estimate of E[Y|X] than a sample from the neighbourhood of X=1. A sample that avoids that region may do better still. Intuitively, if there’s a lot of noise where you want to look, extrapolate from where there’s less noise.
But it’s not clear to me that this bears on the Bayesian vs. frequentist matter. Both of them are faced with the decision to take a wide sample or a narrow one. The frequentist can’t insist that the Bayesian takes notice of structure in the problem that the frequentist chooses to ignore.
That question must be directed at both the Bayesian and the frequentist. In my other comment I gave two toy examples, in one of which looking at a wider sample is provably inferior to looking only at f(X)=1, and one in which the reverse is the case. Anyone faced with the problem of estimating E[Y|f(X)=1] needs to decide, somehow, what observations to make.
How do a Bayesian or a frequentist make that decision?
I didn’t reply to your other comment because although you are making valid points, you have veered off-topic since your initial comment. The question of “which observations to make?” is not a question of inference but rather one of experimental design. If you think this question is relevant to the discussion, it means that you neither understand the original post nor my reply to your initial comment. The questions I am asking have to do with what to infer after the observations have already been made.
Ok. So the scenario is that you are sampling only from the population f(X)=1. Can you exhibit a simple example of the scenario in the section “A non-parametric Bayesian approach” with an explicit, simple class of functions g and distribution over them, for which the proposed procedure arrives at a better estimate of E[ Y | f(X)=1 ] than the sample average?
Is the idea that it is intended to demonstrate, simply that prior knowledge about the joint distribution of X and Y would, combined with the sample, give a better estimate than the sample alone?
EDIT: Correct, but you should not be too hung up on the issue of conditional sampling. The scenario would not change if we were sampling from the whole population. The important point is that we are trying to estimate a conditional mean of the form E[Y|f(X)=1]. This is a concept commonly seen in statistics. For example, the goal of non-parametric regression is to estimate a curve defined by f(x) = E[Y|X=x].
The example I gave in my first reply (where g(x) is known to be either one of two known functions h(x) or j(x)) can easily be extended into the kind of fully specified counterexample you are looking for: I’m not going to bother to do it, because it’s very tedious to write out and it’s frankly a homework-level problem.
The fact that prior information can improve your estimate is already well-known to statisticians. But statisticians disagree on whether or not you should try to model your prior information in the form of a Bayesian model. Some Bayesians have expressed the opinion that one should always do so. This post, along with Wasserman/Robbins/Ritov’s paper, provides counterexamples where the full non-parametric Bayesian model gives much worse results than the “naive” approach which ignores the prior.
That looks like a parametric model. There is one parameter, a binary variable that chooses h or j. A belief about that parameter is a probability p that h is the function. Yes, I can see that updating p on sight of the data may give a better estimate of E[Y|f(X)=1], which is known a priori to be either h(1) or j(1).
I expect it would be similar for small numbers of parameters also, such as a linear relationship between X and Y. Using the whole sample should improve on only looking at the subsample around f(X)=1.
However, in the nonparametric case (I think you are arguing) this goes wrong. The sample size is not large enough to estimate a model that gives a narrow estimate of E[Y|f(X)=1]. Am I understanding you yet?
It seems to me that the problem arises even before getting to the nonparametric case. If a parametric model has too many parameters to estimate from the sample, and the model predictions are everywhere sensitive to all of the parameters (so it cannot be approximated by any simpler model) then trying to estimate E[Y|f(X)=1] by first fitting the model, then predicting from the model, will also not work.
It so clearly will not work that it must be a wrong thing to do. It is not yet clear to me that a Bayesian statistician must do it anyway. The set {Y|f(X)=1} conveys information about E[Y|f(X)=1] directly, independently of the true model (assumed for the purpose of this discussion to be within the model space being considered). Estimating it via fitting a model ignores that information. Is there no Bayesian method of using it?
A partial answer to your question:
would be that the less the model helps, the less attention you pay it relative to calculating Mean{Y|f(X)=1}. I don’t have a mathematical formulation of how to do that though.