Repeating my post from the last open thread, for better visibility:
I want to study probability and statistics in a deeper way than the Probability and Statistics course I had to take in the university. The problem is, my mathematical education isn’t very good (on the level of Calculus 101). I’m not afraid of math, but so far all the books I could find are either about pure application, with barely any explanations, or they start with a lot of assumptions about my knowledge and introduce reams of unfamiliar notation.
I want a deeper understanding of the basic concepts. Like, mean is an indicator of the central tendency of a sample. Intuitively, it makes sense. But why this particular formula of sum/n? You can apply all kinds of mathematical stuff to the sample. And it’s even worse with variance...
I too spent a few years with a similar desire to understand probability and statistics at a deeper level, but we might have been stuck on different things. Here’s an explanation:
Suppose you have 37 numbers. Purchase a massless ruler and 37 identical weights. For each of your numbers, find the number on the ruler and glue a weight there. You now have a massless ruler with 37 weights glued onto it.
Now try to balance the ruler sideways on a spike sticking out of the ground. The mean of your numbers will be the point on the ruler where it balances.
Now spin the ruler on the spike. It’s easy to speed up or slow down the spinning ruler if the weights are close together, but more force is required if the weights are far apart. The variance of your numbers is proportional to the amount the ruler resists changes to its angular velocity—how hard you have to twist the ruler to make it spin, or to make it stop spinning.
“I’d like to understand this more deeply” is a thought that occurs to people at many levels of study, so this explanation could be too high or low. Where did my comment hit?
If you are frustrated with hand waving in calculus, read a Real Analysis textbook. The magic words which explain how the heck you can have a probability distributions over real numbers is measure theory).
How does that answer the question? It’s true that the center of gravity is a mean, but the moment of inertia is not a variance. It’s one thing to say something is “proportional to a variance” to mean that the constant is 2 or pi, but when the constant is the number of points, I think it’s missing the statistical point.
But the bigger problem is that these are not statistical examples! Means and sums of squares occur many places, but why are they are a good choice for the central tendency and the tendency to be central? Are you suggesting that we think of a random variable as a physical rod? Why? Does trying to spin it have any probabilistic or statistical meaning?
I wasn’t aiming to answer Locaha’s question as much as figure out what question to answer. The range of math knowledge here is high, and I don’t know where Locaha stands. I mean,
But why [is the mean calculated as] sum/n?
That could be a basic question about the meaning of averages—the sort of knowledge I internalized so deeply that I have trouble forming it into words.
But maybe Locaha’s asking a question like:
Why is an unbiased estimator of population mean a sum/n, but an unbiased estimator of population variance a sum/(n-1)?
That’s a less philosophical question. So if Locaha says “means are like the centers of mass! I never understood that intuition until now!”, I’ll have a different follow up than if Locaha says “Yes, captain obvious, of course means are like centers of mass. I’m asking about XYZ”.
Mean and variance are closely related to center of mass and moment of inertia. This is good intuition to have, and it’s statistical. The only difference is that the first two are moments of a probability distribution, and the second two are moments of a mass distribution.
If you are frustrated with explanations in calculus, read a Real Analysis textbook. And the magic words that explain how the heck you can have probability distributions over real numbers is measure theory.
When you have thousands of different pieces of data, to grasp it mentally, you need to replace them with some simplification. For example, instead of a thousand different weights you could imagine a thousand identical weights, such that the new set is somehow the same as the original set; and then you would focus on the individual weight from the new set.
What precisely does “somehow the same as the original set” mean? Well, it depends on what did the numbers from the original set do; how exactly they join together.
For example, if we speak about weights, the natural way of “joining together” is to add their weight. Thus the new set of the identical weights is equivalent to the original set if the sum of the new set is the same as sum of the old set. The sum of the new set = number of pieces × weight of one piece. Therefore the weight of the piece in the new set is the sum of the pieces in the original set divided by their number; the “sum/n”.
Specifically, if addition is the natural thing to do, the set 3, 4, 8 is equivalent to 5, 5, 5, because 3 + 4 + 8 = 5 + 5 + 5. Saying that “5 is the mean of the original set” means “the original set behaves (with regards to the natural thing to do, i.e. addition) as if it was composed of the 5′s”.
There are situations where some other operation is the natural thing to do. Sometimes it is multiplication. For example, if you multiply some original value with 2, and they you multiply it by 8, the result of these two operations is the same as if you would multiply it twice by 4. In this case it’s called geometric mean, and it’s a root of product.
It can be even more complicated, so it doesn’t necessarily have a name, but the idea is always replacing the original set with a set of identical values such that in the original context they would behave the same way. For example, the example above could be described as a 100% growth (multiplication by 2) and 700% growth (multiplication by 8), and you need to get a result 300% (multiplication by 4); in which case it would be “root of (product of (Xi + 100%)) − 100%”.
If there is no meaningful operation in the original set, if the set can be ordered, we can pick the median. If the set can’t even be ordered, if there are discrete values, we can pick the most frequent value as the best approximation of the original set.
I don’t think that’s really what means are. That intuition might fit the median better. One reason means are nice is that they have really nice properties, e.g. they’re linear under addition of random variables. That makes them particularly easy to compute with and/or prove theorems about. Another reason means are nice is related to betting and the interpretation of a mean as an expected value; the theorem justifying this interpretation is the law of large numbers.
Nevertheless in many situations the mean of a random variable is a very bad description of it (e.g. mean income is a terrible description of the income distribution and median would be much more appropriate).
Edit: On the other hand, here’s one very undesirable property of means: they’re not “covariant under increasing changes of coordinates,” which on the other hand is true of medians. What I mean is the following: suppose you decide to compute the mean population of all cities in the US, but later decide this is a bad idea because there are some really big cities. If you suspect that city populations grow multiplicatively rather than additively (e.g. the presence of good thing X causes a city to be 1.2x bigger than it otherwise would, as opposed to 200 people bigger), you might decide that instead of looking at population you should look at log population. But the mean of log population is not the log of mean population!
On the other hand, because log is an increasing function, the median of log population is still the log of median population. So taking medians is in some sense insensitive to these sorts of decisions, which is nice.
I asked a similar question a while back, and I was directed to this book, which I found to be incredibly useful. It is written at an elementary level, has minimal little maths, yet is still technical, and brings across so many central ideas in very clear, Bayesian, terms. It is also on Lukeprog’s CSA book recommendations for ‘Become Smart Quickly’.
Note: this is the only probability textbook I have read. I’ve glanced through the openings of others, and they’ve tended to be above my level. I am sixteen.
As a first step, I suggest Dennis Lindley’s Understanding Uncertainty. It’s written for the layperson, so there’s not much in the way of mathematical detail, but it is very good for clarifying the basic concepts, and covers some surprisingly sophisticated topics.
ETA: Ah, I didn’t notice that Benito had already recommended this book. Well, consider this a second opinion then.
The problem with most Probability and Statistics courses is the axiomatic approach. Purely formalism. Here are the rules—you can play by them if you want to.
Jaynes was such a revelation for me, because he starts with something you want, not arbitrary rules and conventions. He builds probability theory on basic desiredata of reason that you that make sense. He had reasons for my “whys?”.
Also, standard statistics classes always seemed a bit perverse to me—logically backward. They always just felt wrong. Jaynes approach replaced that tortured backward thinking with clear, straight lines going forward. You’re always asking the same basic question “What is the probability of A given that I know B?”
And he also had the best notation. Even if I’m not going to do any math, I’ll often formulate a problem using his notation to clarify my thinking.
IS this a good book to start with? I know it’s the standard “Bayes” intro around here, but is it good for someone with, let’s say, zero formal probability/statistics training?
I was under the impression that the “this is definitely not a book for beginners” was the standard consensus here: I seem to recall seeing some heavily-upvoted comments saying that you should be approximately at the level of a math/stats graduate student before reading it. I couldn’t find them with a quick search, but here’s one comment that explicitly recommends another book over it.
I think it’s even better if you’re not familiar with frequentist statistics because you won’t have to unlearn it first, but I know many people here disagree.
I suppose it’s better that to never have suffered through frequentist statistics first, but I think you appreciate the right way a lot more after you’ve had to suffer through the wrong way for a while.
Well, Jaynes does point out how bad frequentism is as often as he can get away with. I guess the main thing you’re missing out if you weren’t previously familiar with it is knowing whether he’s attacking a strawman.
I want a deeper understanding of the basic concepts. Like, mean is an indicator of the central tendency of a sample. Intuitively, it makes sense. But why this particular formula of sum/n? You can apply all kinds of mathematical stuff to the sample.
The mean of the sum of two random variables is the sum of the means (ditto with the variances); there’s no similarly simple formula for the median. (See ChristianKl’s comment for why you’d care about the sum.)
The mean if the value of x that minimizes SUM_i (x—x_i)^2; if you have to approximate all elements in your sample with the same value and the cost of an imperfect approximation is the square distance from the exact value (and any smooth function looks like the square when you’re sufficiently close to the minimum), then you should use the mean.
(Of course, all this means that if you’re more likely to multiply things together than add them, the badness of an approximation depends on the ratio between it and the true value rather than the difference, and things are distributed log-normally, you should use the geometric mean instead. Or just take the log of everything.)
This isn’t at introductory level, but try exploring the ideas around Fisher information—it basically ties together information theory and some important statistical concepts.
Fisher Information is hugely important in that it lets you go from just treating a family of distributions as a collection of things to treating them as a space with its own meaningful geometry. The wikipedia page doesn’t really convey it but this write-up by Roger Grosse does. This has been known for decades but the inferential distance to what folks like Amari and Barndorff-Nielsen write is vast.
Attending a CFAR workshop and session on Bayes (the ‘advanced’ session) helped me understand a lot of things in an intuitive way. Reading some online stuff to get intuitions about how Bayes’ theorem and probability mass work was helpful too. I took an advanced stats course right after doing these things, and ended up learning all the math correctly, and it solidified my intuitions in a really nice way. (Other students didn’t seem to have as good a time without those intuitions.) So that might be a good order to do things in.
Some multidimensional calc might be helpful, but other than that, I think you don’t need too much other math to support learning more probability and stats.
Not really—but I do agree that it’s absolutely vital to understand the basic concepts or terms. I think that’s a major reason why people fail to learn—they just don’t really grasp the most vital concepts. That’s especially true of fields with lots of technical terms. If you don’t understand the terms you’ll struggle to follow even basic lines of reasoning.
For this reason I sometimes provide students with a list of central terms, together with comprehensive explanations of what they mean, when I teach.
I don’t have a good resource for you—I’ve had too much math education to pin down exactly where I picked up this kind of logic. I’d recommend set theory in general for getting an understanding of how math works and how to talk and read precisely in mathematics.
For your specific question about the mean, it’s the only number such that the sum of all (samples—mean) equals zero. Go ahead and play with the algebra to show it to yourself. What it means is that if you go off of the mean, you’re going to be more positive of the numbers in the sample than you are negative, or more negative than positive.
http://intelligence.org/courses/ has information on set theory. I also enjoyed reading Bertrand Russell’s “Principia Mathematica”, but haven’t evaluated it as a source for learning set theory.
Repeating my post from the last open thread, for better visibility:
I want to study probability and statistics in a deeper way than the Probability and Statistics course I had to take in the university. The problem is, my mathematical education isn’t very good (on the level of Calculus 101). I’m not afraid of math, but so far all the books I could find are either about pure application, with barely any explanations, or they start with a lot of assumptions about my knowledge and introduce reams of unfamiliar notation.
I want a deeper understanding of the basic concepts. Like, mean is an indicator of the central tendency of a sample. Intuitively, it makes sense. But why this particular formula of sum/n? You can apply all kinds of mathematical stuff to the sample. And it’s even worse with variance...
Any ideas how to proceed?
I too spent a few years with a similar desire to understand probability and statistics at a deeper level, but we might have been stuck on different things. Here’s an explanation:
Suppose you have 37 numbers. Purchase a massless ruler and 37 identical weights. For each of your numbers, find the number on the ruler and glue a weight there. You now have a massless ruler with 37 weights glued onto it.
Now try to balance the ruler sideways on a spike sticking out of the ground. The mean of your numbers will be the point on the ruler where it balances.
Now spin the ruler on the spike. It’s easy to speed up or slow down the spinning ruler if the weights are close together, but more force is required if the weights are far apart. The variance of your numbers is proportional to the amount the ruler resists changes to its angular velocity—how hard you have to twist the ruler to make it spin, or to make it stop spinning.
“I’d like to understand this more deeply” is a thought that occurs to people at many levels of study, so this explanation could be too high or low. Where did my comment hit?
Moments of mass in physics is a good intro to moments in stats for people who like to visualize or “feel out” concepts concretely. Good post!
A different level explanation, which may or may not be helpful:
Read up on affine space, convex combinations, and maybe this article about torsors.
If you are frustrated with hand waving in calculus, read a Real Analysis textbook. The magic words which explain how the heck you can have a probability distributions over real numbers is measure theory).
How does that answer the question?
It’s true that the center of gravity is a mean, but the moment of inertia is not a variance. It’s one thing to say something is “proportional to a variance” to mean that the constant is 2 or pi, but when the constant is the number of points, I think it’s missing the statistical point.
But the bigger problem is that these are not statistical examples! Means and sums of squares occur many places, but why are they are a good choice for the central tendency and the tendency to be central? Are you suggesting that we think of a random variable as a physical rod? Why? Does trying to spin it have any probabilistic or statistical meaning?
I wasn’t aiming to answer Locaha’s question as much as figure out what question to answer. The range of math knowledge here is high, and I don’t know where Locaha stands. I mean,
That could be a basic question about the meaning of averages—the sort of knowledge I internalized so deeply that I have trouble forming it into words.
But maybe Locaha’s asking a question like:
That’s a less philosophical question. So if Locaha says “means are like the centers of mass! I never understood that intuition until now!”, I’ll have a different follow up than if Locaha says “Yes, captain obvious, of course means are like centers of mass. I’m asking about XYZ”.
Mean and variance are closely related to center of mass and moment of inertia. This is good intuition to have, and it’s statistical. The only difference is that the first two are moments of a probability distribution, and the second two are moments of a mass distribution.
Using the word “distribution” doesn’t make it statistical.
Telegraph to a younger me:
If you are frustrated with explanations in calculus, read a Real Analysis textbook. And the magic words that explain how the heck you can have probability distributions over real numbers is measure theory.
When you have thousands of different pieces of data, to grasp it mentally, you need to replace them with some simplification. For example, instead of a thousand different weights you could imagine a thousand identical weights, such that the new set is somehow the same as the original set; and then you would focus on the individual weight from the new set.
What precisely does “somehow the same as the original set” mean? Well, it depends on what did the numbers from the original set do; how exactly they join together.
For example, if we speak about weights, the natural way of “joining together” is to add their weight. Thus the new set of the identical weights is equivalent to the original set if the sum of the new set is the same as sum of the old set. The sum of the new set = number of pieces × weight of one piece. Therefore the weight of the piece in the new set is the sum of the pieces in the original set divided by their number; the “sum/n”.
Specifically, if addition is the natural thing to do, the set 3, 4, 8 is equivalent to 5, 5, 5, because 3 + 4 + 8 = 5 + 5 + 5. Saying that “5 is the mean of the original set” means “the original set behaves (with regards to the natural thing to do, i.e. addition) as if it was composed of the 5′s”.
There are situations where some other operation is the natural thing to do. Sometimes it is multiplication. For example, if you multiply some original value with 2, and they you multiply it by 8, the result of these two operations is the same as if you would multiply it twice by 4. In this case it’s called geometric mean, and it’s a root of product.
It can be even more complicated, so it doesn’t necessarily have a name, but the idea is always replacing the original set with a set of identical values such that in the original context they would behave the same way. For example, the example above could be described as a 100% growth (multiplication by 2) and 700% growth (multiplication by 8), and you need to get a result 300% (multiplication by 4); in which case it would be “root of (product of (Xi + 100%)) − 100%”.
If there is no meaningful operation in the original set, if the set can be ordered, we can pick the median. If the set can’t even be ordered, if there are discrete values, we can pick the most frequent value as the best approximation of the original set.
I don’t think that’s really what means are. That intuition might fit the median better. One reason means are nice is that they have really nice properties, e.g. they’re linear under addition of random variables. That makes them particularly easy to compute with and/or prove theorems about. Another reason means are nice is related to betting and the interpretation of a mean as an expected value; the theorem justifying this interpretation is the law of large numbers.
Nevertheless in many situations the mean of a random variable is a very bad description of it (e.g. mean income is a terrible description of the income distribution and median would be much more appropriate).
Edit: On the other hand, here’s one very undesirable property of means: they’re not “covariant under increasing changes of coordinates,” which on the other hand is true of medians. What I mean is the following: suppose you decide to compute the mean population of all cities in the US, but later decide this is a bad idea because there are some really big cities. If you suspect that city populations grow multiplicatively rather than additively (e.g. the presence of good thing X causes a city to be 1.2x bigger than it otherwise would, as opposed to 200 people bigger), you might decide that instead of looking at population you should look at log population. But the mean of log population is not the log of mean population!
On the other hand, because log is an increasing function, the median of log population is still the log of median population. So taking medians is in some sense insensitive to these sorts of decisions, which is nice.
I asked a similar question a while back, and I was directed to this book, which I found to be incredibly useful. It is written at an elementary level, has minimal little maths, yet is still technical, and brings across so many central ideas in very clear, Bayesian, terms. It is also on Lukeprog’s CSA book recommendations for ‘Become Smart Quickly’.
Note: this is the only probability textbook I have read. I’ve glanced through the openings of others, and they’ve tended to be above my level. I am sixteen.
As a first step, I suggest Dennis Lindley’s Understanding Uncertainty. It’s written for the layperson, so there’s not much in the way of mathematical detail, but it is very good for clarifying the basic concepts, and covers some surprisingly sophisticated topics.
ETA: Ah, I didn’t notice that Benito had already recommended this book. Well, consider this a second opinion then.
Read Edwin Jaynes.
The problem with most Probability and Statistics courses is the axiomatic approach. Purely formalism. Here are the rules—you can play by them if you want to.
Jaynes was such a revelation for me, because he starts with something you want, not arbitrary rules and conventions. He builds probability theory on basic desiredata of reason that you that make sense. He had reasons for my “whys?”.
Also, standard statistics classes always seemed a bit perverse to me—logically backward. They always just felt wrong. Jaynes approach replaced that tortured backward thinking with clear, straight lines going forward. You’re always asking the same basic question “What is the probability of A given that I know B?”
And he also had the best notation. Even if I’m not going to do any math, I’ll often formulate a problem using his notation to clarify my thinking.
I think this is a most awesome mistype of desiderata.
Here, have a book!
http://www-biba.inrialpes.fr/Jaynes/prob.html
Actually, I started reading that one and found it too hard.
IS this a good book to start with? I know it’s the standard “Bayes” intro around here, but is it good for someone with, let’s say, zero formal probability/statistics training?
I was under the impression that the “this is definitely not a book for beginners” was the standard consensus here: I seem to recall seeing some heavily-upvoted comments saying that you should be approximately at the level of a math/stats graduate student before reading it. I couldn’t find them with a quick search, but here’s one comment that explicitly recommends another book over it.
I think it’s even better if you’re not familiar with frequentist statistics because you won’t have to unlearn it first, but I know many people here disagree.
I suppose it’s better that to never have suffered through frequentist statistics first, but I think you appreciate the right way a lot more after you’ve had to suffer through the wrong way for a while.
Well, Jaynes does point out how bad frequentism is as often as he can get away with. I guess the main thing you’re missing out if you weren’t previously familiar with it is knowing whether he’s attacking a strawman.
I agree, that’s why I’m glad I learned Bayes first. Makes you appreciate the good stuff more.
Did you misread the comment you’re replying to, are you sarcastic, or am I missing something?
The mean of the sum of two random variables is the sum of the means (ditto with the variances); there’s no similarly simple formula for the median. (See ChristianKl’s comment for why you’d care about the sum.)
The mean if the value of x that minimizes SUM_i (x—x_i)^2; if you have to approximate all elements in your sample with the same value and the cost of an imperfect approximation is the square distance from the exact value (and any smooth function looks like the square when you’re sufficiently close to the minimum), then you should use the mean.
The mean and variance are jointly sufficient statistics for the normal distribution
Possibly something else which doesn’t come to my mind at the moment.
(Of course, all this means that if you’re more likely to multiply things together than add them, the badness of an approximation depends on the ratio between it and the true value rather than the difference, and things are distributed log-normally, you should use the geometric mean instead. Or just take the log of everything.)
This isn’t at introductory level, but try exploring the ideas around Fisher information—it basically ties together information theory and some important statistical concepts.
Fisher Information is hugely important in that it lets you go from just treating a family of distributions as a collection of things to treating them as a space with its own meaningful geometry. The wikipedia page doesn’t really convey it but this write-up by Roger Grosse does. This has been known for decades but the inferential distance to what folks like Amari and Barndorff-Nielsen write is vast.
Attending a CFAR workshop and session on Bayes (the ‘advanced’ session) helped me understand a lot of things in an intuitive way. Reading some online stuff to get intuitions about how Bayes’ theorem and probability mass work was helpful too. I took an advanced stats course right after doing these things, and ended up learning all the math correctly, and it solidified my intuitions in a really nice way. (Other students didn’t seem to have as good a time without those intuitions.) So that might be a good order to do things in.
Some multidimensional calc might be helpful, but other than that, I think you don’t need too much other math to support learning more probability and stats.
Not really—but I do agree that it’s absolutely vital to understand the basic concepts or terms. I think that’s a major reason why people fail to learn—they just don’t really grasp the most vital concepts. That’s especially true of fields with lots of technical terms. If you don’t understand the terms you’ll struggle to follow even basic lines of reasoning.
For this reason I sometimes provide students with a list of central terms, together with comprehensive explanations of what they mean, when I teach.
I don’t have a good resource for you—I’ve had too much math education to pin down exactly where I picked up this kind of logic. I’d recommend set theory in general for getting an understanding of how math works and how to talk and read precisely in mathematics.
For your specific question about the mean, it’s the only number such that the sum of all (samples—mean) equals zero. Go ahead and play with the algebra to show it to yourself. What it means is that if you go off of the mean, you’re going to be more positive of the numbers in the sample than you are negative, or more negative than positive.
Can you recommend a place to start learning about set theory?
http://intelligence.org/courses/ has information on set theory. I also enjoyed reading Bertrand Russell’s “Principia Mathematica”, but haven’t evaluated it as a source for learning set theory.