Lumifer comments on Open thread, Sep. 28 - Oct. 4, 2015

Lumifer 1 Oct 2015 14:54 UTC
0 points
To start with, there is some confusion—you say

you’re asking about inferring some property

which isn’t so. You are asking about inferring some property, and I’m asking about the meaning of the words you are using.

However, getting to the meat of the issue, I’d like to make two points.

Point one is distinguishing between sample statistics and estimates of the parameters of the underlying process. In our case we have an underlying process (warming, let’s say we define it as the net energy balance of the planet integrated over a suitable interval) which we cannot observe directly, and some data (land and ocean temperatures) which we can.

The data that we have is, in statistical terminology, a sample and we commonly try to figure out properties of the underlying process by looking at the sample that we have. The thing is, sample statistics are not random. If I have some data (e.g. a time series of temperatures) and I calculate its mean, that mean is not a random variable. The probability of it is 1 -- we observed it, it happened. There is no inference involved in calculating sample means, just straight math. Now, if you want estimates of a mean of the underlying process, that’s a different issue. It’s going to be an uncertain estimate and we will have to specify some sort of a model to even produce such and estimate and talk about how likely it is.

In this case, when I’m talking about the hiatus as a feature of the data, it’s not a probabilistic, there is nothing to infer. But if you want to know whether there is a hiatus in the underlying process of global warming, it’s a different question and much more complicated, too.

Point two is more general and a bit more interesting. It’s common to think in terms of data and models: you have some data and you fit some models to it. You can describe your data without using any models—for example, calculate the sample mean. However as your description of data grows more complex, at some point you cross a (fuzzy) line and start to talk about the same data in terms of models, implied or explicit. Where that fuzzy line is located is subject to debate. For example, you put that line almost at the end of the spectrum when you say that the only thing we can say about a time series without involving models or inferences is that x=f(t) and that’s all. I find that not very useful and my line is further away. I’m not claiming any kind of precision here, but a full-blown ARIMA representation of a time series I would call a model, and something like an AR(1) coefficient would be right on the boundary: is it just a straightforward math calculation, or are you fitting an autoregressive model to the time series?