I have a variant on linear regression. Can anyone tell me what it’s called / point me to more info about it / tell me that it’s (trivially reducible to / nothing like) standard linear regression?
Standard linear regression has a known matrix X = x(i,j) and a known target vector Y = y(j), and seeks to find weights W = w(i) to best approximate X * W = Y.
In my version, instead of knowing the values of the input variables (X), I know how much each contributes to the output. So I don’t know x(i,j) but I kind of know x(i,j) * w(i), except that W isn’t really a thing. And I know some structure on X: every value is either 0, or equal to every other value in its row. (I can tell those apart because the 0s contribute zero and the others contribute nonzero.) I want to find the best W to approximate X * W = Y, but that question will depend on what I want to do with the uncertainty in X, and I’m not sure about that.
I should probably avoid giving my specific scenario, so think widget sales. You can either sell a widget in a city or not. Sales of a widget will be well-correlated between cities: if widget sells well in New York, it will probably sell well in Detroit and in Austin and so on, with the caveat that selling well in New York means a lot more sales than selling well in Austin. I have a list of previous widgets, and how much they sold in each city. Received wisdom is that a widget will sell about twice as much in New York as in Detroit, and a third more than in Austin, but I want to improve on the received wisdom.
So I’m told that a widget will sell 10 million, and that it will be sold in (list of cities). I want to come up with the best estimate for its sales in New York, its sales in Austin, etc.
Sounds like your problem is fitting a sparse matrix, i.e. where you want many entries to be 0. This is usually called compressed sensing, and it’s non-trivial.
Well, it’s going to depend on some specifics and on how much data do you have (with the implications for the complexity of the model that you can afford), but the most basic approach that comes to my mind doesn’t involve any regression at all.
Given your historical data (“I have a list of previous widgets, and how much they sold in each city”) you can convert the sales per widget per city into percentages (e.g. widget A sold 27% in New York, 15% in Austin, etc.) and then look at the empirical distribution of these percentages by city.
The next step would be introducing some conditionality—e.g. checking whether the sales percentage per city depends, for example, on the number of cities where the widget was sold.
Generally speaking, you want to find some structure in your percentages by city, but what kind of structure is there really depends on your particular data.
The problem—at least the one I’m currently focusing on, which might not be the one I need to focus on—is converting percentages-by-city on a collection of subsets, into percentages-by-city in general. I’m currently assuming that there’s no structure beyond what I specified, partly because I’m not currently able to take advantage of it if there is.
A toy example, with no randomness, would be—widget A sold 2⁄3 in city X and 1⁄3 in city Y. Widget B sold 6⁄7 in city X and 1⁄7 in city Z. Widget C sold 3⁄4 in city Y and 1⁄4 in city Z. Widget D is to be sold in cities X, Y and Z. What fraction of its sales should I expect to come from each city?
The answer here is 0.6 from X, 0.3 from Y and 0.1 from Z, but I’m looking for some way to generate these in the face of randomness. (My first thought was to take averages—e.g. city A got an average of (2/3 + 6⁄7)/2 = 16⁄21 of the sales—and then normalize those averages. But none of the AM, GM and HM gave the correct results on the toy version, so I don’t expect them to do well with high randomness. It might be that with more data they come closer to being correct, so that’s something I’ll look into if no one can point me to existing literature.)
So, there’s some sort of function mapping from (cities,widgets)->sales, plus randomness. In general, I would say use some standard machine learning technique, but if you know the function is linear you can do it directly.
So:
sales=constant x cityvalue x widgetvalue + noise
d sales/d cityvalue = constant x widgetvalue
d sales/d widgetvalue = constant x cityvalue
(all vectors)
So then you pick random starting values of cityvalue , widgetvalue, calculate the error and do gradient decent.
Or just plug
Error = sum((constant x cityvalue x widgetvalue—sales)^2)
Into an optimisation function, which will be slower but quicker to code.
You need to specify what kind of randomness you are expecting. For example, the standard ordinary least-squares regression expects no noise at all in the X values and the noise in Y to be additive, iid, and zero-mean Gaussian. If you relax some of these assumptions (e.g. your noise is autocorrelated) some properties of your regression estimates hold and some do not any more.
In the frequentist paradigm I expect you to need something in the maximum-likelihood framework. In the Bayesian paradigm you’ll need to establish a prior and then update on your data in a fairly straightforward way.
In any case you need to be able to write down a model for the process that generates your data. Once you do, you will know the parameters you need to estimate and the form of the model will dictate how the estimation will proceed.
Sure, I’m aware that this is the sort of thing I need to think about. It’s just that right now, even if I do specify exactly how I think the generating process works, I still need to work out how to do the estimation. I somewhat suspect that’s outside of my weight class (I wouldn’t trust myself to be able to invent linear regression, for example). Even if it’s not, if someone else has already done the work, I’d prefer not to duplicate it.
It’s just that right now, even if I do specify exactly how I think the generating process works, I still need to work out how to do the estimation.
If you can implement a good simulation of the generating process, then you are already done—estimating is as simple as ABC. (Aside from the hilariously high computing demands of the naive/exact ABC, I’ve been pleased & impressed just how dang easy it is to use ABC. Complicated interval-censored data? No problem. Even more complicated mixture distribution / multilevel problem where data flips from garbage to highly accurate? Ne pas!)
Even if you know only the generating process and not an estimation procedure, you might be able to get away with just feeding a parametrization of the generating process into an MCMC sampler, and seeing whether the sampler converges on sensible posterior distributions for the parameters.
I like Stan for this; you write a file telling Stan the data’s structure, the parameters of the generating process, and how the generating process produced the data, and Stan turns it into an MCMC sampling program you can run.
If the model isn’t fully identified you can get problems like the sampler bouncing around the parameter space indefinitely without ever converging on a decent posterior. This could be a problem here; to illustrate, suppose I write out my version of skeptical_lurker’s formulation of the model in the obvious naive way —
— where brackets capture city & widget-type indices, I have a β for every city and a γ for every widget type, and I assume there’s no odd correlations between the different parameters.
This version of the model won’t have a single optimal solution! If the model finds a promising set of parameter values, it can always produce another equally good set of parameter values by halving all of the β values and doubling all of the γ values; or by halving α and the γ values while quadrupling the β values; or by...you get the idea. A sampler might end up pulling a Flying Dutchman, swooping back and forth along a hyper-hyperbola in parameter space.
I think this sort of under-identification isn’t necessarily a killer in Stan if your parameter priors are unimodal and not too diffuse, because the priors end up as a lodestar for the sampler, but I’m not an expert. To be safe, I could avoid the issue by picking a specific city and a specific widget as a reference widget type, with the other cities’ β and other widgets’ γ effectively defined as proportional to those:
if city == 1 and widget == 1: sales(city, widget) = α + noise(city, widget)
else, if city == 1: sales(city, widget) = α × γ(widget) + noise(city, widget)
Then run the sampler and back out estimates of the overall city-level sales fractions from the parameter estimates (1 / (1+sum(β)), β(2) / (1+sum(β)), β(3) / (1+sum(β)), etc.).
And I’d probably make the noise term multiplicative and non-negative, instead of additive, to prevent the sampler from landing on a negative sales figure, which is presumably nonsensical in this context.
Apologies if I’m rambling at you about something you already know about, or if I’ve focused so much on one specific version of the toy example that this is basically useless. Hopefully this is of some interest...
And I’d probably make the noise term multiplicative and non-negative, instead of additive, to prevent the sampler from landing on a negative sales figure, which is presumably nonsensical in this context.
I know JAGS lets you put interval limits onto terms which lets you specify that some variable must be non-negative (looks something like dist(x,y)[0,∞]), so maybe STAN has something similar.
I see now I could’ve described the model better. In Stan I don’t think you can literally write the observed data as the sum of the signal and the noise; I think the data always has to be incorporated into the model as something sampled from a probability distribution, so you’d actually translate the simplest additive model into Stan-speak as something like
data {
int<lower=1> N;
int<lower=1> Ncities;
int<lower=1> Nwidgets;
int<lower=1> city[N];
int<lower=1> widget[N];
real<lower=0> sales[N];
}
parameters {
real<lower=0> alpha;
real beta[Ncities];
real gamma[Nwidgets];
real<lower=0> sigma;
}
model {
// put code here to define explicit prior distributions for parameters
for (n in 1:N) {
// the tilde means the left side's sampled from the right side
sales[n] ~ normal(alpha + beta[city[n]] + theta[widget[n]], sigma);
}
}
which could give you a headache because a normal distribution puts nonzero probability density on negative sales values, so the sampler might occasionally try to give sales[n] a negative value. When this happens, Stan notices that’s inconsistent with sales[n]’s zero lower bound, and generates a warning message. (The quality of the sampling probably gets hurt too, I’d guess.)
And I don’t know a way to tell Stan, “ah, the normal error has to be non-negative”, since the error isn’t explicitly broken out into a separate term on which one can set bounds; the error’s folded into the procedure of sampling from a normal distribution.
The way to avoid this that clicks most with me is to bake the non-negativity into the model’s heart by sampling sales[n] from a distribution with non-negative support:
for (n in 1:N) {
sales[n] ~ lognormal(log(alpha * beta[city[n]] * theta[widget[n]]), sigma);
}
Of course, bearing in mind the last time I indulged my lognormal fetish, this is likely to have trouble too, for the different reason that a lognormal excludes the possibility of exactly zero sales, and you’d want to either zero-inflate the model or add a fixed nonzero offset to sales before putting the data into Stan. But a lognormal does eliminate the problem of sampling negative values for sales[n], and aligns nicely with multiplicative city & widget effects.
I have a variant on linear regression. Can anyone tell me what it’s called / point me to more info about it / tell me that it’s (trivially reducible to / nothing like) standard linear regression?
Standard linear regression has a known matrix X = x(i,j) and a known target vector Y = y(j), and seeks to find weights W = w(i) to best approximate X * W = Y.
In my version, instead of knowing the values of the input variables (X), I know how much each contributes to the output. So I don’t know x(i,j) but I kind of know x(i,j) * w(i), except that W isn’t really a thing. And I know some structure on X: every value is either 0, or equal to every other value in its row. (I can tell those apart because the 0s contribute zero and the others contribute nonzero.) I want to find the best W to approximate X * W = Y, but that question will depend on what I want to do with the uncertainty in X, and I’m not sure about that.
I should probably avoid giving my specific scenario, so think widget sales. You can either sell a widget in a city or not. Sales of a widget will be well-correlated between cities: if widget sells well in New York, it will probably sell well in Detroit and in Austin and so on, with the caveat that selling well in New York means a lot more sales than selling well in Austin. I have a list of previous widgets, and how much they sold in each city. Received wisdom is that a widget will sell about twice as much in New York as in Detroit, and a third more than in Austin, but I want to improve on the received wisdom.
So I’m told that a widget will sell 10 million, and that it will be sold in (list of cities). I want to come up with the best estimate for its sales in New York, its sales in Austin, etc.
Hopefully this is clear?
Sounds like your problem is fitting a sparse matrix, i.e. where you want many entries to be 0. This is usually called compressed sensing, and it’s non-trivial.
Well, it’s going to depend on some specifics and on how much data do you have (with the implications for the complexity of the model that you can afford), but the most basic approach that comes to my mind doesn’t involve any regression at all.
Given your historical data (“I have a list of previous widgets, and how much they sold in each city”) you can convert the sales per widget per city into percentages (e.g. widget A sold 27% in New York, 15% in Austin, etc.) and then look at the empirical distribution of these percentages by city.
The next step would be introducing some conditionality—e.g. checking whether the sales percentage per city depends, for example, on the number of cities where the widget was sold.
Generally speaking, you want to find some structure in your percentages by city, but what kind of structure is there really depends on your particular data.
The problem—at least the one I’m currently focusing on, which might not be the one I need to focus on—is converting percentages-by-city on a collection of subsets, into percentages-by-city in general. I’m currently assuming that there’s no structure beyond what I specified, partly because I’m not currently able to take advantage of it if there is.
A toy example, with no randomness, would be—widget A sold 2⁄3 in city X and 1⁄3 in city Y. Widget B sold 6⁄7 in city X and 1⁄7 in city Z. Widget C sold 3⁄4 in city Y and 1⁄4 in city Z. Widget D is to be sold in cities X, Y and Z. What fraction of its sales should I expect to come from each city?
The answer here is 0.6 from X, 0.3 from Y and 0.1 from Z, but I’m looking for some way to generate these in the face of randomness. (My first thought was to take averages—e.g. city A got an average of (2/3 + 6⁄7)/2 = 16⁄21 of the sales—and then normalize those averages. But none of the AM, GM and HM gave the correct results on the toy version, so I don’t expect them to do well with high randomness. It might be that with more data they come closer to being correct, so that’s something I’ll look into if no one can point me to existing literature.)
So, there’s some sort of function mapping from (cities,widgets)->sales, plus randomness. In general, I would say use some standard machine learning technique, but if you know the function is linear you can do it directly.
So:
sales=constant x cityvalue x widgetvalue + noise
d sales/d cityvalue = constant x widgetvalue
d sales/d widgetvalue = constant x cityvalue
(all vectors)
So then you pick random starting values of cityvalue , widgetvalue, calculate the error and do gradient decent.
Or just plug
Error = sum((constant x cityvalue x widgetvalue—sales)^2)
Into an optimisation function, which will be slower but quicker to code.
Thank you! This seems like the conceptual shift I needed.
You need to specify what kind of randomness you are expecting. For example, the standard ordinary least-squares regression expects no noise at all in the X values and the noise in Y to be additive, iid, and zero-mean Gaussian. If you relax some of these assumptions (e.g. your noise is autocorrelated) some properties of your regression estimates hold and some do not any more.
In the frequentist paradigm I expect you to need something in the maximum-likelihood framework. In the Bayesian paradigm you’ll need to establish a prior and then update on your data in a fairly straightforward way.
In any case you need to be able to write down a model for the process that generates your data. Once you do, you will know the parameters you need to estimate and the form of the model will dictate how the estimation will proceed.
Sure, I’m aware that this is the sort of thing I need to think about. It’s just that right now, even if I do specify exactly how I think the generating process works, I still need to work out how to do the estimation. I somewhat suspect that’s outside of my weight class (I wouldn’t trust myself to be able to invent linear regression, for example). Even if it’s not, if someone else has already done the work, I’d prefer not to duplicate it.
If you can implement a good simulation of the generating process, then you are already done—estimating is as simple as ABC. (Aside from the hilariously high computing demands of the naive/exact ABC, I’ve been pleased & impressed just how dang easy it is to use ABC. Complicated interval-censored data? No problem. Even more complicated mixture distribution / multilevel problem where data flips from garbage to highly accurate? Ne pas!)
Even if you know only the generating process and not an estimation procedure, you might be able to get away with just feeding a parametrization of the generating process into an MCMC sampler, and seeing whether the sampler converges on sensible posterior distributions for the parameters.
I like Stan for this; you write a file telling Stan the data’s structure, the parameters of the generating process, and how the generating process produced the data, and Stan turns it into an MCMC sampling program you can run.
If the model isn’t fully identified you can get problems like the sampler bouncing around the parameter space indefinitely without ever converging on a decent posterior. This could be a problem here; to illustrate, suppose I write out my version of skeptical_lurker’s formulation of the model in the obvious naive way —
— where brackets capture city & widget-type indices, I have a β for every city and a γ for every widget type, and I assume there’s no odd correlations between the different parameters.
This version of the model won’t have a single optimal solution! If the model finds a promising set of parameter values, it can always produce another equally good set of parameter values by halving all of the β values and doubling all of the γ values; or by halving α and the γ values while quadrupling the β values; or by...you get the idea. A sampler might end up pulling a Flying Dutchman, swooping back and forth along a hyper-hyperbola in parameter space.
I think this sort of under-identification isn’t necessarily a killer in Stan if your parameter priors are unimodal and not too diffuse, because the priors end up as a lodestar for the sampler, but I’m not an expert. To be safe, I could avoid the issue by picking a specific city and a specific widget as a reference widget type, with the other cities’ β and other widgets’ γ effectively defined as proportional to those:
Then run the sampler and back out estimates of the overall city-level sales fractions from the parameter estimates (1 / (1+sum(β)), β(2) / (1+sum(β)), β(3) / (1+sum(β)), etc.).
And I’d probably make the noise term multiplicative and non-negative, instead of additive, to prevent the sampler from landing on a negative sales figure, which is presumably nonsensical in this context.
Apologies if I’m rambling at you about something you already know about, or if I’ve focused so much on one specific version of the toy example that this is basically useless. Hopefully this is of some interest...
I know JAGS lets you put interval limits onto terms which lets you specify that some variable must be non-negative (looks something like
dist(x,y)[0,∞]
), so maybe STAN has something similar.It does. However...
I see now I could’ve described the model better. In Stan I don’t think you can literally write the observed data as the sum of the signal and the noise; I think the data always has to be incorporated into the model as something sampled from a probability distribution, so you’d actually translate the simplest additive model into Stan-speak as something like
which could give you a headache because a normal distribution puts nonzero probability density on negative sales values, so the sampler might occasionally try to give
sales[n]
a negative value. When this happens, Stan notices that’s inconsistent withsales[n]
’s zero lower bound, and generates a warning message. (The quality of the sampling probably gets hurt too, I’d guess.)And I don’t know a way to tell Stan, “ah, the normal error has to be non-negative”, since the error isn’t explicitly broken out into a separate term on which one can set bounds; the error’s folded into the procedure of sampling from a normal distribution.
The way to avoid this that clicks most with me is to bake the non-negativity into the model’s heart by sampling
sales[n]
from a distribution with non-negative support:Of course, bearing in mind the last time I indulged my lognormal fetish, this is likely to have trouble too, for the different reason that a lognormal excludes the possibility of exactly zero sales, and you’d want to either zero-inflate the model or add a fixed nonzero offset to
sales
before putting the data into Stan. But a lognormal does eliminate the problem of sampling negative values forsales[n]
, and aligns nicely with multiplicative city & widget effects.Thanks to both you and gwern. It doesn’t look like this is the direction I’m going in for this problem, but it’s something I’m glad to know about.