I don’t understand how you get more than two dimensions out of data points that are either 0 or 1 (unless perhaps the votes were accompanied by data on age, sex, politics?) and anyway what I usually think of as ‘dimension’ is just the number of entries in each data point, which is fixed. It seems to me that this is perhaps a term of art which your friend is using in a specific way without explaining that it’s jargon.
However, on further thought I think I can bridge the gap. If I understand your explanation correctly, your friend is looking for the minimum set of variables which explains the distribution. I think this has to mean that there is more data than yes-or-no; suppose there is also age and gender, and everyone above 30 votes yes and everyone below thirty votes no. Then you could have had dimensionality two, some combination of age and gender is required to predict the vote; but in fact age predicts it perfectly and you can just throw out gender, so the actual dimensionality is one.
So what we are looking for is the number of parameters in the model that explains the data, as opposed to the number of observables in the data. In physics, however, we generally have a fairly specific model in mind before gathering the data. Let me first give a trivial example: Suppose you have some data that you believe is generated by a Gaussian distribution with mean 0, but you don’t know the sigma. Then you do the following: Assume some particular sigma, and for each event, calculate the probability of seeing that event. Multiply the probabilities. (In fact, for practical purposes we take the log-probability and add, avoiding some numerical issues on computers, but obviously this is isomorphic.) Now scan sigma and see which value maximises the probability of your observations; that’s your estimate for sigma, with errors given by the values at which the log-probability drops by 0.5. (It’s a bit involved to derive, but basically this corresponds to the frequentist 66%-confidence limits assuming the log-probability function is symmetric around the maximum.)
Now, the LessWrong-trained eye can, presumably, immediately see the underlying Bayes-structure here. We are finding the set of parameters that maximises the posterior probability of our data. In my toy example you can just scan the parameter space, point by point. For realistic models with, say, forty parameters—as was the case in my thesis—you have to be a bit more clever and use some sort of search algorithm that doesn’t rely on brute force. (With forty parameters, even if you take only 10 points in each, you instantly have 10^40 points to evaluate—that is, at each point you calculate the probability for, say, half a million events with what may be quite a computationally expensive function. Not practical.)
The above is what I think of when I say “fitting a distribution”. Now let me try to bring it back into contact with the finding-the-dimensions problem. The difference is that your friend is dealing with a set of variables such that some of them may directly account for others, as in my age/vote toy example. But in the models we fit to physics distributions, not all the parameters are necessarily directly observed in the event. An obvious example is the time resolution of the detector; this is not a property of the event (at least not solely of the event—some events are better measured than others) and anyway you can’t really say that the resolution ‘explains’ the value of the time (and note that decay times are continuous, not multiple-choice as in most survey data.) Rather, the observed distribution of the time is generated by the true distribution convolved with the resolution—you have to do a convolution integral. If you measure a high (and therefore unlikely, since we’re dealing with exponential decay) time, it may be that you really have an unusual event, or it may be that you have a common event with a bad resolution that happened to fluctuate up. The point, however, is that there’s no single discrete-valued resolution variable that accounts for a discrete-valued time variable; it’s all continuous distributions, derived quantities, and convolution integrals.
So, we do not treat our data sets in the way you describe, looking for the true dimensionality. Instead we assume some physics model with a fixed number of parameters and seek the probability-maximising value of those parameters. Obviously this approach has its disadvantages compared to the more data-driven method you describe, but basically this is forced upon us by the shape of the problem. It is common to try several different models, and report the variance as a systematic error.
So, to get back to Lie groups, Weyl integration, and representation theory: None of the above. :)
I definitely agree that the type of analysis I originally had in mind is totally different than what you are describing.
Thinking about distributions without thinking about Lie groups makes my brain hurt, unless the distributions you’re discussing have no symmetries or continuous properties at all—my guess is that they’re there but for your purposes they’re swept under the rug?
But yeah in essence the “fitting a distribution” I was thinking is far less constrained I think—you have no idea a priori what the distribution is, so you first attempt to isolate how many dimensions you need to explain it. In the case of votes, we might look at F_2^N, think about it as being embedded into the 0s and 1s of [0,1]^N, and try to find what sort of an embedded manifold would have a distribution that looks like that.
Whereas in your case you basically know what your manifold is and what your distribution is like, but you’re looking for the specifics of the map—i.e. the size (and presumably “direction”?) of sigma.
I don’t think “disadvantages” is the right word—these processes are essentially solving for totally unrelated unknowns.
Thinking about distributions without thinking about Lie groups makes my brain hurt, unless the distributions you’re discussing have no symmetries or continuous properties at all—my guess is that they’re there but for your purposes they’re swept under the rug?
That is entirely possible; all I can tell you is that I’ve never used any such tool for looking at physics data. And I might add that thinking about how to apply Lie groups to these measurements makes my brain hurt. :)
I just mean… any distribution is really a topological object. If there are symmetries to your space, it’s a group. So all distributions live on a Lie group naturally. I assume you do harmonic analysis at least—that process doesn’t make any sense unless it lives on a Lie group! I think of distributions as essentially being functionals on a Lie group, and finding a fitting distribution is essentially integrating against its top-level differentials (if not technically at least morally.)
But if all your Lie groups are just vector spaces and the occasional torus (which they might very well be) then there might be no reason for you to even use the word Lie group because you don’t need the theory at all.
Yes. My original statement was that harmonic analysis is limited to Lie groups. jsteinhardt observed that any locally compact abelian group can have harmonic analysis done on it—some of these (say, p-adic groups) are not Lie groups, since they have no smooth structure, though they are still topological groups.
So I was trying to be less specific by changing my term from Lie group to topological group.
I find this interesting, but I like to apply things to a specific example so I’m sure I understand it. Suppose I give you the following distribution of measurements of two variables (units are GeV, not that I suppose this matters):
What sort of topological object is this, or how do you go about treating it as one? Presumably you can think of these points in mD-deltaM space as being two-dimensional vectors. N-vectors are a group under addition, and if I understand the definition correctly they are also a Lie group. But I confess I don’t understand how this is important; I’m never going to add together two events, the operation doesn’t make any sense. If a group lives in a forest and never actually uses its operator, does it still associate, close, identify, and invert? (I further observe that although 2-vectors are a group, the second variable in this case can’t go below 0.13957 for kinematic reasons; the subset of actual observations is not going to be closed or invertible.)
I’m not sure what harmonic analysis is; I might know it by another name, or do it all the time and not realise that’s what it’s called. Could you give an example?
My attempts at putting LaTeX notation here didn’t work out, so I hope this is at all readable.
I would not call the data you gave me a distribution. I think of a distribution as being something like a Gaussian; some function f where, if I keep collecting data, and I take the average sum of powers of that data, it looks like the integral over some topological group of that function.
so: lim n->\infty sum.{k=1}^n g(x.k,y.k) = \int_{R^2} f(x,y)g(x,y) dx ^ dy
for any function g on R^2
usually rather than integrating over R^2, I would be integrating over SU(2) or some other matrix group; meaning the group structure isn’t additive; usually I’d expect data to be like traces of matrices or something; for example on the appropriate subgroup of GL(2,R)+ these traces should never be below two; that sort of kinematic reason should translate into insight about what group you’re integrating over.
When you say “fitting distributions” I assume you’re looking for the appropriate f(x) (at least, after a fashion) in the above equality; minimizing a variable which should be the difference between the limits in some sense.
I may be a little out of my depth here, though.
Sorry I didn’t mean harmonic analysis, I meant Fourier analysis. I am under the impression that this is everywhere in physics and electrical engineering?
I was a little sloppy in my language; strictly speaking ‘distribution’ does refer to a generating function, not to the generated data.
When you say “fitting distributions” I assume you’re looking for the appropriate f(x) (at least, after a fashion) in the above equality; minimizing a variable which should be the difference between the limits in some sense.
Yes, exactly.
Sorry I didn’t mean harmonic analysis, I meant Fourier analysis.
We certainly do partial waves, but not on absolutely everything. Take a detector resolution with unknown parameters; it can usually be well modelled by a simple Gaussian, and then there’s no partial waves, there’s just the two parameters and the exponential.
lim n->\infty sum.{k=1}^n g(x.k,y.k) = \int_{R^2} f(x,y)g(x,y) dx ^ dy for any function g on R^2
Maybe something got lost in the notation? In the limit of n going to infinity the sum should likewise go to infinity, while the integral may converge. Also it’s not clear to me what the function g is doing. I prefer to think in terms of probabilities: We seek some function f such that, in the limit of infinite data, the fraction of data falling within (x0, x0+epsilon) equals the integral on (x0, x0+epsilon) of f with respect to x, divided by the integral over all x. Generalise to multiple dimensions as required; taking the limit epsilon->0 is optional.
average sum of powers of that data,
I’m not sure what an average sum of powers is; where do you do this in the formula you gave? Is it encapsulated in the function g? Does it reduce to “just count the events” (as in the fraction-of-events goal above) in some limit?
Yes, there was supposed to be a 1/n in the sum, sorry!
Essentially what the g is doing is taking the place of the interval probabilities; for example, if I think of g as being the characteristic function on an interval (one on that interval and zero elsewhere) then the sum and integral should both be equal to the probability of a point landing in that interval. Then one can approximate all measurable functions by characteristic functions or somesuch to make the equivalence.
In practice (for me) in Fourier analysis you prove this for a basis, such as integer powers of cosine on a close interval, or simply integer powers on an open interval (these are the moments of a distribution).
I’m not sure what an average sum of powers is; where do you do this in the formula you gave? Is it encapsulated in the function g?
Yes; after you add in the 1/n hopefully the “average” part makes sense, and then just take g for a single variable to be x^k and vary over integers k. And as I mentioned above, yes I believe it does reduce to just “count the events;” just if you want to prove things you need to count using a countable basis of function space rather than looking at intervals.
It looks to me like we’ve bridged the gap between the approaches. We are doing the same thing, but the physics case is much more specific: We have a generating function in mind and just want to know its parameters, and we look only at the linear average, we don’t vary the powers (*). So we don’t use the tools you mentioned in the comment that started this thread, because they’re adapted to the much more general case.
(*) Edit to add: Actually, on further thought, that’s not entirely true. There are cases where we take moments of distributions and whatnot; a friend of mine who was a PhD student at the same time as me worked on such an analysis. It’s just sufficiently rare (or maybe just rare in my experience!) that it didn’t come to my mind right away.
Okay, so my hypothesis that basically all of the things that I care about are swept under the rug because you only care about what I would call trivial cases was essentially right.
And it definitely makes sense that if you’ve already restricted to a specific function and you just want parameters that you really don’t need to deal with higher moments.
I don’t understand how you get more than two dimensions out of data points that are either 0 or 1 (unless perhaps the votes were accompanied by data on age, sex, politics?) and anyway what I usually think of as ‘dimension’ is just the number of entries in each data point, which is fixed. It seems to me that this is perhaps a term of art which your friend is using in a specific way without explaining that it’s jargon.
However, on further thought I think I can bridge the gap. If I understand your explanation correctly, your friend is looking for the minimum set of variables which explains the distribution. I think this has to mean that there is more data than yes-or-no; suppose there is also age and gender, and everyone above 30 votes yes and everyone below thirty votes no. Then you could have had dimensionality two, some combination of age and gender is required to predict the vote; but in fact age predicts it perfectly and you can just throw out gender, so the actual dimensionality is one.
So what we are looking for is the number of parameters in the model that explains the data, as opposed to the number of observables in the data. In physics, however, we generally have a fairly specific model in mind before gathering the data. Let me first give a trivial example: Suppose you have some data that you believe is generated by a Gaussian distribution with mean 0, but you don’t know the sigma. Then you do the following: Assume some particular sigma, and for each event, calculate the probability of seeing that event. Multiply the probabilities. (In fact, for practical purposes we take the log-probability and add, avoiding some numerical issues on computers, but obviously this is isomorphic.) Now scan sigma and see which value maximises the probability of your observations; that’s your estimate for sigma, with errors given by the values at which the log-probability drops by 0.5. (It’s a bit involved to derive, but basically this corresponds to the frequentist 66%-confidence limits assuming the log-probability function is symmetric around the maximum.)
Now, the LessWrong-trained eye can, presumably, immediately see the underlying Bayes-structure here. We are finding the set of parameters that maximises the posterior probability of our data. In my toy example you can just scan the parameter space, point by point. For realistic models with, say, forty parameters—as was the case in my thesis—you have to be a bit more clever and use some sort of search algorithm that doesn’t rely on brute force. (With forty parameters, even if you take only 10 points in each, you instantly have 10^40 points to evaluate—that is, at each point you calculate the probability for, say, half a million events with what may be quite a computationally expensive function. Not practical.)
The above is what I think of when I say “fitting a distribution”. Now let me try to bring it back into contact with the finding-the-dimensions problem. The difference is that your friend is dealing with a set of variables such that some of them may directly account for others, as in my age/vote toy example. But in the models we fit to physics distributions, not all the parameters are necessarily directly observed in the event. An obvious example is the time resolution of the detector; this is not a property of the event (at least not solely of the event—some events are better measured than others) and anyway you can’t really say that the resolution ‘explains’ the value of the time (and note that decay times are continuous, not multiple-choice as in most survey data.) Rather, the observed distribution of the time is generated by the true distribution convolved with the resolution—you have to do a convolution integral. If you measure a high (and therefore unlikely, since we’re dealing with exponential decay) time, it may be that you really have an unusual event, or it may be that you have a common event with a bad resolution that happened to fluctuate up. The point, however, is that there’s no single discrete-valued resolution variable that accounts for a discrete-valued time variable; it’s all continuous distributions, derived quantities, and convolution integrals.
So, we do not treat our data sets in the way you describe, looking for the true dimensionality. Instead we assume some physics model with a fixed number of parameters and seek the probability-maximising value of those parameters. Obviously this approach has its disadvantages compared to the more data-driven method you describe, but basically this is forced upon us by the shape of the problem. It is common to try several different models, and report the variance as a systematic error.
So, to get back to Lie groups, Weyl integration, and representation theory: None of the above. :)
I definitely agree that the type of analysis I originally had in mind is totally different than what you are describing.
Thinking about distributions without thinking about Lie groups makes my brain hurt, unless the distributions you’re discussing have no symmetries or continuous properties at all—my guess is that they’re there but for your purposes they’re swept under the rug?
But yeah in essence the “fitting a distribution” I was thinking is far less constrained I think—you have no idea a priori what the distribution is, so you first attempt to isolate how many dimensions you need to explain it. In the case of votes, we might look at F_2^N, think about it as being embedded into the 0s and 1s of [0,1]^N, and try to find what sort of an embedded manifold would have a distribution that looks like that.
Whereas in your case you basically know what your manifold is and what your distribution is like, but you’re looking for the specifics of the map—i.e. the size (and presumably “direction”?) of sigma.
I don’t think “disadvantages” is the right word—these processes are essentially solving for totally unrelated unknowns.
That is entirely possible; all I can tell you is that I’ve never used any such tool for looking at physics data. And I might add that thinking about how to apply Lie groups to these measurements makes my brain hurt. :)
tl;dr: I like talking about math.
Fair enough :)
I just mean… any distribution is really a topological object. If there are symmetries to your space, it’s a group. So all distributions live on a Lie group naturally. I assume you do harmonic analysis at least—that process doesn’t make any sense unless it lives on a Lie group! I think of distributions as essentially being functionals on a Lie group, and finding a fitting distribution is essentially integrating against its top-level differentials (if not technically at least morally.)
But if all your Lie groups are just vector spaces and the occasional torus (which they might very well be) then there might be no reason for you to even use the word Lie group because you don’t need the theory at all.
You can do harmonic analysis on any locally compact abelian group, see e.g. Pontryagin duality.
“locally compact” implies you have a topology—maybe I should be saying “topological group” rather than “Lie group,” though.
All Lie groups already have a topology. They’re manifolds, after all.
Yes. My original statement was that harmonic analysis is limited to Lie groups. jsteinhardt observed that any locally compact abelian group can have harmonic analysis done on it—some of these (say, p-adic groups) are not Lie groups, since they have no smooth structure, though they are still topological groups.
So I was trying to be less specific by changing my term from Lie group to topological group.
Oh. That makes more sense.
I find this interesting, but I like to apply things to a specific example so I’m sure I understand it. Suppose I give you the following distribution of measurements of two variables (units are GeV, not that I suppose this matters):
1.80707 0.148763 1.87494 0.151895 1.86805 0.140318 1.85676 0.143774 1.85299 0.150823 1.87689 0.151625 1.87127 0.14012 1.89415 0.145116 1.87558 0.141176 1.86508 0.14773 1.89724 0.149112
What sort of topological object is this, or how do you go about treating it as one? Presumably you can think of these points in mD-deltaM space as being two-dimensional vectors. N-vectors are a group under addition, and if I understand the definition correctly they are also a Lie group. But I confess I don’t understand how this is important; I’m never going to add together two events, the operation doesn’t make any sense. If a group lives in a forest and never actually uses its operator, does it still associate, close, identify, and invert? (I further observe that although 2-vectors are a group, the second variable in this case can’t go below 0.13957 for kinematic reasons; the subset of actual observations is not going to be closed or invertible.)
I’m not sure what harmonic analysis is; I might know it by another name, or do it all the time and not realise that’s what it’s called. Could you give an example?
My attempts at putting LaTeX notation here didn’t work out, so I hope this is at all readable.
I would not call the data you gave me a distribution. I think of a distribution as being something like a Gaussian; some function f where, if I keep collecting data, and I take the average sum of powers of that data, it looks like the integral over some topological group of that function.
so: lim n->\infty sum.{k=1}^n g(x.k,y.k) = \int_{R^2} f(x,y)g(x,y) dx ^ dy for any function g on R^2
usually rather than integrating over R^2, I would be integrating over SU(2) or some other matrix group; meaning the group structure isn’t additive; usually I’d expect data to be like traces of matrices or something; for example on the appropriate subgroup of GL(2,R)+ these traces should never be below two; that sort of kinematic reason should translate into insight about what group you’re integrating over.
When you say “fitting distributions” I assume you’re looking for the appropriate f(x) (at least, after a fashion) in the above equality; minimizing a variable which should be the difference between the limits in some sense.
I may be a little out of my depth here, though.
Sorry I didn’t mean harmonic analysis, I meant Fourier analysis. I am under the impression that this is everywhere in physics and electrical engineering?
I was a little sloppy in my language; strictly speaking ‘distribution’ does refer to a generating function, not to the generated data.
Yes, exactly.
We certainly do partial waves, but not on absolutely everything. Take a detector resolution with unknown parameters; it can usually be well modelled by a simple Gaussian, and then there’s no partial waves, there’s just the two parameters and the exponential.
Maybe something got lost in the notation? In the limit of n going to infinity the sum should likewise go to infinity, while the integral may converge. Also it’s not clear to me what the function g is doing. I prefer to think in terms of probabilities: We seek some function f such that, in the limit of infinite data, the fraction of data falling within (x0, x0+epsilon) equals the integral on (x0, x0+epsilon) of f with respect to x, divided by the integral over all x. Generalise to multiple dimensions as required; taking the limit epsilon->0 is optional.
I’m not sure what an average sum of powers is; where do you do this in the formula you gave? Is it encapsulated in the function g? Does it reduce to “just count the events” (as in the fraction-of-events goal above) in some limit?
Yes, there was supposed to be a 1/n in the sum, sorry!
Essentially what the g is doing is taking the place of the interval probabilities; for example, if I think of g as being the characteristic function on an interval (one on that interval and zero elsewhere) then the sum and integral should both be equal to the probability of a point landing in that interval. Then one can approximate all measurable functions by characteristic functions or somesuch to make the equivalence.
In practice (for me) in Fourier analysis you prove this for a basis, such as integer powers of cosine on a close interval, or simply integer powers on an open interval (these are the moments of a distribution).
Yes; after you add in the 1/n hopefully the “average” part makes sense, and then just take g for a single variable to be x^k and vary over integers k. And as I mentioned above, yes I believe it does reduce to just “count the events;” just if you want to prove things you need to count using a countable basis of function space rather than looking at intervals.
It looks to me like we’ve bridged the gap between the approaches. We are doing the same thing, but the physics case is much more specific: We have a generating function in mind and just want to know its parameters, and we look only at the linear average, we don’t vary the powers (*). So we don’t use the tools you mentioned in the comment that started this thread, because they’re adapted to the much more general case.
(*) Edit to add: Actually, on further thought, that’s not entirely true. There are cases where we take moments of distributions and whatnot; a friend of mine who was a PhD student at the same time as me worked on such an analysis. It’s just sufficiently rare (or maybe just rare in my experience!) that it didn’t come to my mind right away.
Okay, so my hypothesis that basically all of the things that I care about are swept under the rug because you only care about what I would call trivial cases was essentially right.
And it definitely makes sense that if you’ve already restricted to a specific function and you just want parameters that you really don’t need to deal with higher moments.