The theorems here should apply directly to that situation. The summary will eventually be lower-dimensional than the data, and that’s all we need for these theorems to apply. At any data size n sufficiently large that the O(log n) summary dimension is smaller than the data dimension, the distribution of those n data points must be exponential family.
I see! To reiterate: for a fixed n, having a smaller summary dimension means that the distribution of those n points is part of the exponential family.
It seems like there are three cases here:
The size of the summary remains constant as number n of samples grows. The distribution of these samples remains constant. Once enough data is collected, you can estimate the summary statistics and infer the overall distribution.
The size of the summary grows as O(log(n)). The distribution of these samples changes for each n as the size of the summary grows. You converge on the correct distribution in the limit of infinite n. (This seems weird/incorrect, I might be missing something here. I am trying to think of a concrete example)
degenerate case: the size of the required “summary” grows faster than O(n) so you are better off just using the data itself as the summary.
For the second case, each data point can measure something different, possibly correlated with each other, and related in different ways to the parameters we’re trying to estimate. For instance, maybe we’re trying to estimate some parameters of a car, so we measure the wheel sizes, axle length, number of gears, engine cylinder volume, etc, etc. Every now and then we measure something which gives us a totally different “kind” of information from the other things we measured—something which forces a non-exponential-family update. When that happens, we have to add a new summary component. Eventually other data points may measure the same “kind” of information and also contribute to that component of the summary. But over time, it becomes more and more rare to measure something which no other data point has captured before, so we add summary dimensions more and more slowly.
(Side note: there’s no reason I know why O(log n) growth would be special here; the qualitative story would be similar for any sub-linear summary growth.)
Good question.
The theorems here should apply directly to that situation. The summary will eventually be lower-dimensional than the data, and that’s all we need for these theorems to apply. At any data size n sufficiently large that the O(log n) summary dimension is smaller than the data dimension, the distribution of those n data points must be exponential family.
I see! To reiterate: for a fixed n, having a smaller summary dimension means that the distribution of those n points is part of the exponential family.
It seems like there are three cases here:
The size of the summary remains constant as number n of samples grows. The distribution of these samples remains constant. Once enough data is collected, you can estimate the summary statistics and infer the overall distribution.
The size of the summary grows as O(log(n)). The distribution of these samples changes for each n as the size of the summary grows. You converge on the correct distribution in the limit of infinite n. (This seems weird/incorrect, I might be missing something here. I am trying to think of a concrete example)
degenerate case: the size of the required “summary” grows faster than O(n) so you are better off just using the data itself as the summary.
For the second case, each data point can measure something different, possibly correlated with each other, and related in different ways to the parameters we’re trying to estimate. For instance, maybe we’re trying to estimate some parameters of a car, so we measure the wheel sizes, axle length, number of gears, engine cylinder volume, etc, etc. Every now and then we measure something which gives us a totally different “kind” of information from the other things we measured—something which forces a non-exponential-family update. When that happens, we have to add a new summary component. Eventually other data points may measure the same “kind” of information and also contribute to that component of the summary. But over time, it becomes more and more rare to measure something which no other data point has captured before, so we add summary dimensions more and more slowly.
(Side note: there’s no reason I know why O(log n) growth would be special here; the qualitative story would be similar for any sub-linear summary growth.)