I am 70% confident that if we were smarter then we would not need it.
If you have some data that you (magically) know the likelihood and prior. Then you would have some uncertainty from the parameters in the model and some from the parameters, this would then change the form of the posterior for example from normal to a t-distribution to account for this extra uncertainty.
In the real world we assume a likelihood and guess a prior, and even with simple models such as y ~ ax + b we will usually model the residual errors as a normal distribution and thus thus loose some of the uncertainty, thus our residual errors are different in and out of sample.
Practical Reason
Also, a model with more* parameters will always have less residual errors (unless you screw up the prior) and thus the in sample predictions will seem better
Modern Bayesians have found two ways to solve this issue
WAIC: Which uses information theory see how the posterior predictive distribution captures the generative process and penalizes for the effective number of parameters.
PSIS-LOO: does a very fast version of LOO-CV where for each yi you factor that yi contribution to the posterior to get an out of sample posterior predictive estimate for yi.
Bayesian Models just like Frequentest Models are vulnerable to over fitting if they have many parameters and weak priors.
*Some models have parameters which constrains other parameters thus what I mean is “effective” parameters according to the WAIC or PSIS-LOO estimation, parameters with strong priors are very constrained and count as much less than 1.
″… a model with more* parameters will always have less residual errors (unless you screw up the prior) and thus the in sample predictions will seem better”
Not always (unless you’re sweeping all exceptions under “unless you screw up the prior”). With more parameters, the prior probability for the region of the parameter space that fits the data well may be smaller, so the posterior may be mostly outside this region. Note that “less residual errors” isn’t avery clear concept in Bayesian terms—there’s a posterior distribution of residual error on the training set, not a single value. (There is a single residual error when making the Bayesian prediction averaged over the posterior, but this residual error also doesn’t necessarily go down when the model becomes more complex.)
“Bayesian Models just like Frequentest Models are vulnerable to over fitting if they have many parameters and weak priors.”
Actually, Bayesian models with many parameters and weak priors tend to under fit the data (assuming that by “weak” you mean “vague” / “high variance”), since the weak priors in a high dimensional space give high prior probability to the data not being fit well.
It seems to me like there are two distinct issues: estimating error of model on future data and model comparison.
1⟼ It would be useful to know the most likely value of error on an future data before we actually use the model; but is this what test set error represents—the most likely value of error on future data?
2⟼ Why do we use techniques like WAIC and PSIS-LOO when we can (and should?) simply use p(M|D) i.e. Bayes factors, Ockham factors, Model Evidence, etc.? These techniques seem to work well for over-fitting (see image below). Once we find the more plausible model, we use it to make predictions
Wild Speculation:
I am 70% confident that if we were smarter then we would not need it.
If you have some data that you (magically) know the likelihood and prior. Then you would have some uncertainty from the parameters in the model and some from the parameters, this would then change the form of the posterior for example from normal to a t-distribution to account for this extra uncertainty.
In the real world we assume a likelihood and guess a prior, and even with simple models such as y ~ ax + b we will usually model the residual errors as a normal distribution and thus thus loose some of the uncertainty, thus our residual errors are different in and out of sample.
Practical Reason
Also, a model with more* parameters will always have less residual errors (unless you screw up the prior) and thus the in sample predictions will seem better
Modern Bayesians have found two ways to solve this issue
WAIC: Which uses information theory see how the posterior predictive distribution captures the generative process and penalizes for the effective number of parameters.
PSIS-LOO: does a very fast version of LOO-CV where for each yi you factor that yi contribution to the posterior to get an out of sample posterior predictive estimate for yi.
Bayesian Models just like Frequentest Models are vulnerable to over fitting if they have many parameters and weak priors.
*Some models have parameters which constrains other parameters thus what I mean is “effective” parameters according to the WAIC or PSIS-LOO estimation, parameters with strong priors are very constrained and count as much less than 1.
″… a model with more* parameters will always have less residual errors (unless you screw up the prior) and thus the in sample predictions will seem better”
Not always (unless you’re sweeping all exceptions under “unless you screw up the prior”). With more parameters, the prior probability for the region of the parameter space that fits the data well may be smaller, so the posterior may be mostly outside this region. Note that “less residual errors” isn’t avery clear concept in Bayesian terms—there’s a posterior distribution of residual error on the training set, not a single value. (There is a single residual error when making the Bayesian prediction averaged over the posterior, but this residual error also doesn’t necessarily go down when the model becomes more complex.)
“Bayesian Models just like Frequentest Models are vulnerable to over fitting if they have many parameters and weak priors.”
Actually, Bayesian models with many parameters and weak priors tend to under fit the data (assuming that by “weak” you mean “vague” / “high variance”), since the weak priors in a high dimensional space give high prior probability to the data not being fit well.
It seems to me like there are two distinct issues: estimating error of model on future data and model comparison.
1⟼ It would be useful to know the most likely value of error on an future data before we actually use the model; but is this what test set error represents—the most likely value of error on future data?
2⟼ Why do we use techniques like WAIC and PSIS-LOO when we can (and should?) simply use p(M|D) i.e. Bayes factors, Ockham factors, Model Evidence, etc.? These techniques seem to work well for over-fitting (see image below). Once we find the more plausible model, we use it to make predictions