Imagine we had a billion data points which were so noisy that they were the equivalent of a thousand pretty clear data points. Then a parametrized model should have about a thousand parameters, and a non-parametric approach should average about a million noisy neighbors. If we have the equivalent of about ten clear data points, we should have a model with ten free parameters or we should average a hundred million noisy neighbors. The parametrized approach then seems to have clear advantages in terms of how much you need to communicate and remember to apply the approach, and in how easy it is for others to verify that you are applying the approach well. Remember that much of human morality is about social norms that we check if others are following, and reward or punish them accordingly. I suspect that these communication advantages are why academics focus on parametrized models. On accuracy, are there good canonical sources showing that non-param tends to beat param holding other things constant? Also, it isn’t clear to me that non-param approaches don’t have just as much trouble with error-prone application as param approaches.
On accuracy, are there good canonical sources showing that non-param tends to beat param holding other things constant?
Definitely not. Non-param is something you do in a particular sort of situation. Lots of data, true generator hard to model, lots of neighborhood structure in the data, a la the Netflix Prize? Definitely time to try non-parametric. Twenty data points that look like mostly a straight line? I’d say use a straight line.
A parameterized model with a thousand generic parameters, that isn’t supposed to represent the true underlying generator or a complicated causal model, but rather fit an arbitrary curve directly to the data, that we think is regular yet complicated, would I think be considered “nonparametric statistics” as such things go. Splines with lots of joints are technically parameterized curves, but they are (I think) considered a nonparametric statistical method.
The most important canonical rule of modern machine learning is that we don’t have good, simple, highly accurate canonical rules for predicting in advance what sort of algorithm will work best on the data. And so instead you’ve got to try different things, and accumulate experience with complicated, error-prone, even non-verbalizable rules about which sorts of algorithms to try. In other words, a machine learning expert would have both parametric and nonparametric algorithms for determining whether to use parametric or nonparametric algorithms...
Actually in social science problems in high dimensional spaces it is rather common to have parametrized models with hundreds or more parameters, especially when one has hundreds of thousands or more data points. For example, one often uses “fixed effects” for individual times or spatial regions, and matrices of interaction terms between basic effects. Folks definitely use param stat methods for such things.
Sure, lots of locally parametric statistics isn’t the same thing as having so many global parameters as to make few assumptions about the shape of the curve. Still, I think this is where we both nod and agree that there’s no absolute border between “parametric” and “nonparametric”?
Well there are clearly many ways to define that distinction. But regarding the costs of communicating and checking, the issue is whether one tells the model or the data set plus metric. Academics usually prefer to communicate a model, and I’m guessing that given their purposes this is probably usually best.
Sure. Though I note that if you’re already communicating a regional map with thousands of locally-fit parameters, you’re already sending a file, and at that point it’s pretty much as easy to send 10MB as 10KB, these days. But there’s all sorts of other reasons why parametric models are more useful for things like rendering causal predictions, relating to other knowledge and other results, and so on. I’m not objecting to that, per se, although in some cases it provides a motive to oversimplify and draw lines through graphs that don’t look like lines...
...but I’m not sure that’s relevant to the original point. From my perspective, the key question is to what degree a statistical method assumes that the underlying generator is simple, versus not imposing much of its own assumptions about the shape of the curve.
Imagine we had a billion data points which were so noisy that they were the equivalent of a thousand pretty clear data points. Then a parametrized model should have about a thousand parameters, and a non-parametric approach should average about a million noisy neighbors. If we have the equivalent of about ten clear data points, we should have a model with ten free parameters or we should average a hundred million noisy neighbors. The parametrized approach then seems to have clear advantages in terms of how much you need to communicate and remember to apply the approach, and in how easy it is for others to verify that you are applying the approach well. Remember that much of human morality is about social norms that we check if others are following, and reward or punish them accordingly. I suspect that these communication advantages are why academics focus on parametrized models. On accuracy, are there good canonical sources showing that non-param tends to beat param holding other things constant? Also, it isn’t clear to me that non-param approaches don’t have just as much trouble with error-prone application as param approaches.
Definitely not. Non-param is something you do in a particular sort of situation. Lots of data, true generator hard to model, lots of neighborhood structure in the data, a la the Netflix Prize? Definitely time to try non-parametric. Twenty data points that look like mostly a straight line? I’d say use a straight line.
A parameterized model with a thousand generic parameters, that isn’t supposed to represent the true underlying generator or a complicated causal model, but rather fit an arbitrary curve directly to the data, that we think is regular yet complicated, would I think be considered “nonparametric statistics” as such things go. Splines with lots of joints are technically parameterized curves, but they are (I think) considered a nonparametric statistical method.
The most important canonical rule of modern machine learning is that we don’t have good, simple, highly accurate canonical rules for predicting in advance what sort of algorithm will work best on the data. And so instead you’ve got to try different things, and accumulate experience with complicated, error-prone, even non-verbalizable rules about which sorts of algorithms to try. In other words, a machine learning expert would have both parametric and nonparametric algorithms for determining whether to use parametric or nonparametric algorithms...
Actually in social science problems in high dimensional spaces it is rather common to have parametrized models with hundreds or more parameters, especially when one has hundreds of thousands or more data points. For example, one often uses “fixed effects” for individual times or spatial regions, and matrices of interaction terms between basic effects. Folks definitely use param stat methods for such things.
Sure, lots of locally parametric statistics isn’t the same thing as having so many global parameters as to make few assumptions about the shape of the curve. Still, I think this is where we both nod and agree that there’s no absolute border between “parametric” and “nonparametric”?
Well there are clearly many ways to define that distinction. But regarding the costs of communicating and checking, the issue is whether one tells the model or the data set plus metric. Academics usually prefer to communicate a model, and I’m guessing that given their purposes this is probably usually best.
Sure. Though I note that if you’re already communicating a regional map with thousands of locally-fit parameters, you’re already sending a file, and at that point it’s pretty much as easy to send 10MB as 10KB, these days. But there’s all sorts of other reasons why parametric models are more useful for things like rendering causal predictions, relating to other knowledge and other results, and so on. I’m not objecting to that, per se, although in some cases it provides a motive to oversimplify and draw lines through graphs that don’t look like lines...
...but I’m not sure that’s relevant to the original point. From my perspective, the key question is to what degree a statistical method assumes that the underlying generator is simple, versus not imposing much of its own assumptions about the shape of the curve.