You need to specify what kind of randomness you are expecting. For example, the standard ordinary least-squares regression expects no noise at all in the X values and the noise in Y to be additive, iid, and zero-mean Gaussian. If you relax some of these assumptions (e.g. your noise is autocorrelated) some properties of your regression estimates hold and some do not any more.
In the frequentist paradigm I expect you to need something in the maximum-likelihood framework. In the Bayesian paradigm you’ll need to establish a prior and then update on your data in a fairly straightforward way.
In any case you need to be able to write down a model for the process that generates your data. Once you do, you will know the parameters you need to estimate and the form of the model will dictate how the estimation will proceed.
Sure, I’m aware that this is the sort of thing I need to think about. It’s just that right now, even if I do specify exactly how I think the generating process works, I still need to work out how to do the estimation. I somewhat suspect that’s outside of my weight class (I wouldn’t trust myself to be able to invent linear regression, for example). Even if it’s not, if someone else has already done the work, I’d prefer not to duplicate it.
It’s just that right now, even if I do specify exactly how I think the generating process works, I still need to work out how to do the estimation.
If you can implement a good simulation of the generating process, then you are already done—estimating is as simple as ABC. (Aside from the hilariously high computing demands of the naive/exact ABC, I’ve been pleased & impressed just how dang easy it is to use ABC. Complicated interval-censored data? No problem. Even more complicated mixture distribution / multilevel problem where data flips from garbage to highly accurate? Ne pas!)
Even if you know only the generating process and not an estimation procedure, you might be able to get away with just feeding a parametrization of the generating process into an MCMC sampler, and seeing whether the sampler converges on sensible posterior distributions for the parameters.
I like Stan for this; you write a file telling Stan the data’s structure, the parameters of the generating process, and how the generating process produced the data, and Stan turns it into an MCMC sampling program you can run.
If the model isn’t fully identified you can get problems like the sampler bouncing around the parameter space indefinitely without ever converging on a decent posterior. This could be a problem here; to illustrate, suppose I write out my version of skeptical_lurker’s formulation of the model in the obvious naive way —
— where brackets capture city & widget-type indices, I have a β for every city and a γ for every widget type, and I assume there’s no odd correlations between the different parameters.
This version of the model won’t have a single optimal solution! If the model finds a promising set of parameter values, it can always produce another equally good set of parameter values by halving all of the β values and doubling all of the γ values; or by halving α and the γ values while quadrupling the β values; or by...you get the idea. A sampler might end up pulling a Flying Dutchman, swooping back and forth along a hyper-hyperbola in parameter space.
I think this sort of under-identification isn’t necessarily a killer in Stan if your parameter priors are unimodal and not too diffuse, because the priors end up as a lodestar for the sampler, but I’m not an expert. To be safe, I could avoid the issue by picking a specific city and a specific widget as a reference widget type, with the other cities’ β and other widgets’ γ effectively defined as proportional to those:
if city == 1 and widget == 1: sales(city, widget) = α + noise(city, widget)
else, if city == 1: sales(city, widget) = α × γ(widget) + noise(city, widget)
Then run the sampler and back out estimates of the overall city-level sales fractions from the parameter estimates (1 / (1+sum(β)), β(2) / (1+sum(β)), β(3) / (1+sum(β)), etc.).
And I’d probably make the noise term multiplicative and non-negative, instead of additive, to prevent the sampler from landing on a negative sales figure, which is presumably nonsensical in this context.
Apologies if I’m rambling at you about something you already know about, or if I’ve focused so much on one specific version of the toy example that this is basically useless. Hopefully this is of some interest...
And I’d probably make the noise term multiplicative and non-negative, instead of additive, to prevent the sampler from landing on a negative sales figure, which is presumably nonsensical in this context.
I know JAGS lets you put interval limits onto terms which lets you specify that some variable must be non-negative (looks something like dist(x,y)[0,∞]), so maybe STAN has something similar.
I see now I could’ve described the model better. In Stan I don’t think you can literally write the observed data as the sum of the signal and the noise; I think the data always has to be incorporated into the model as something sampled from a probability distribution, so you’d actually translate the simplest additive model into Stan-speak as something like
data {
int<lower=1> N;
int<lower=1> Ncities;
int<lower=1> Nwidgets;
int<lower=1> city[N];
int<lower=1> widget[N];
real<lower=0> sales[N];
}
parameters {
real<lower=0> alpha;
real beta[Ncities];
real gamma[Nwidgets];
real<lower=0> sigma;
}
model {
// put code here to define explicit prior distributions for parameters
for (n in 1:N) {
// the tilde means the left side's sampled from the right side
sales[n] ~ normal(alpha + beta[city[n]] + theta[widget[n]], sigma);
}
}
which could give you a headache because a normal distribution puts nonzero probability density on negative sales values, so the sampler might occasionally try to give sales[n] a negative value. When this happens, Stan notices that’s inconsistent with sales[n]’s zero lower bound, and generates a warning message. (The quality of the sampling probably gets hurt too, I’d guess.)
And I don’t know a way to tell Stan, “ah, the normal error has to be non-negative”, since the error isn’t explicitly broken out into a separate term on which one can set bounds; the error’s folded into the procedure of sampling from a normal distribution.
The way to avoid this that clicks most with me is to bake the non-negativity into the model’s heart by sampling sales[n] from a distribution with non-negative support:
for (n in 1:N) {
sales[n] ~ lognormal(log(alpha * beta[city[n]] * theta[widget[n]]), sigma);
}
Of course, bearing in mind the last time I indulged my lognormal fetish, this is likely to have trouble too, for the different reason that a lognormal excludes the possibility of exactly zero sales, and you’d want to either zero-inflate the model or add a fixed nonzero offset to sales before putting the data into Stan. But a lognormal does eliminate the problem of sampling negative values for sales[n], and aligns nicely with multiplicative city & widget effects.
You need to specify what kind of randomness you are expecting. For example, the standard ordinary least-squares regression expects no noise at all in the X values and the noise in Y to be additive, iid, and zero-mean Gaussian. If you relax some of these assumptions (e.g. your noise is autocorrelated) some properties of your regression estimates hold and some do not any more.
In the frequentist paradigm I expect you to need something in the maximum-likelihood framework. In the Bayesian paradigm you’ll need to establish a prior and then update on your data in a fairly straightforward way.
In any case you need to be able to write down a model for the process that generates your data. Once you do, you will know the parameters you need to estimate and the form of the model will dictate how the estimation will proceed.
Sure, I’m aware that this is the sort of thing I need to think about. It’s just that right now, even if I do specify exactly how I think the generating process works, I still need to work out how to do the estimation. I somewhat suspect that’s outside of my weight class (I wouldn’t trust myself to be able to invent linear regression, for example). Even if it’s not, if someone else has already done the work, I’d prefer not to duplicate it.
If you can implement a good simulation of the generating process, then you are already done—estimating is as simple as ABC. (Aside from the hilariously high computing demands of the naive/exact ABC, I’ve been pleased & impressed just how dang easy it is to use ABC. Complicated interval-censored data? No problem. Even more complicated mixture distribution / multilevel problem where data flips from garbage to highly accurate? Ne pas!)
Even if you know only the generating process and not an estimation procedure, you might be able to get away with just feeding a parametrization of the generating process into an MCMC sampler, and seeing whether the sampler converges on sensible posterior distributions for the parameters.
I like Stan for this; you write a file telling Stan the data’s structure, the parameters of the generating process, and how the generating process produced the data, and Stan turns it into an MCMC sampling program you can run.
If the model isn’t fully identified you can get problems like the sampler bouncing around the parameter space indefinitely without ever converging on a decent posterior. This could be a problem here; to illustrate, suppose I write out my version of skeptical_lurker’s formulation of the model in the obvious naive way —
— where brackets capture city & widget-type indices, I have a β for every city and a γ for every widget type, and I assume there’s no odd correlations between the different parameters.
This version of the model won’t have a single optimal solution! If the model finds a promising set of parameter values, it can always produce another equally good set of parameter values by halving all of the β values and doubling all of the γ values; or by halving α and the γ values while quadrupling the β values; or by...you get the idea. A sampler might end up pulling a Flying Dutchman, swooping back and forth along a hyper-hyperbola in parameter space.
I think this sort of under-identification isn’t necessarily a killer in Stan if your parameter priors are unimodal and not too diffuse, because the priors end up as a lodestar for the sampler, but I’m not an expert. To be safe, I could avoid the issue by picking a specific city and a specific widget as a reference widget type, with the other cities’ β and other widgets’ γ effectively defined as proportional to those:
Then run the sampler and back out estimates of the overall city-level sales fractions from the parameter estimates (1 / (1+sum(β)), β(2) / (1+sum(β)), β(3) / (1+sum(β)), etc.).
And I’d probably make the noise term multiplicative and non-negative, instead of additive, to prevent the sampler from landing on a negative sales figure, which is presumably nonsensical in this context.
Apologies if I’m rambling at you about something you already know about, or if I’ve focused so much on one specific version of the toy example that this is basically useless. Hopefully this is of some interest...
I know JAGS lets you put interval limits onto terms which lets you specify that some variable must be non-negative (looks something like
dist(x,y)[0,∞]
), so maybe STAN has something similar.It does. However...
I see now I could’ve described the model better. In Stan I don’t think you can literally write the observed data as the sum of the signal and the noise; I think the data always has to be incorporated into the model as something sampled from a probability distribution, so you’d actually translate the simplest additive model into Stan-speak as something like
which could give you a headache because a normal distribution puts nonzero probability density on negative sales values, so the sampler might occasionally try to give
sales[n]
a negative value. When this happens, Stan notices that’s inconsistent withsales[n]
’s zero lower bound, and generates a warning message. (The quality of the sampling probably gets hurt too, I’d guess.)And I don’t know a way to tell Stan, “ah, the normal error has to be non-negative”, since the error isn’t explicitly broken out into a separate term on which one can set bounds; the error’s folded into the procedure of sampling from a normal distribution.
The way to avoid this that clicks most with me is to bake the non-negativity into the model’s heart by sampling
sales[n]
from a distribution with non-negative support:Of course, bearing in mind the last time I indulged my lognormal fetish, this is likely to have trouble too, for the different reason that a lognormal excludes the possibility of exactly zero sales, and you’d want to either zero-inflate the model or add a fixed nonzero offset to
sales
before putting the data into Stan. But a lognormal does eliminate the problem of sampling negative values forsales[n]
, and aligns nicely with multiplicative city & widget effects.Thanks to both you and gwern. It doesn’t look like this is the direction I’m going in for this problem, but it’s something I’m glad to know about.