Let’s start with the application of the central limit theorem to champagne drinkers. First, there’s the distinction between “liver weights are normally distributed” and “mean of a sample of liver weights is normally distributed”. The latter is much better-justified, since we compute the mean by adding a bunch of (presumably independent) random variables together. And the latter is usually what we actually use in basic analysis of experimental data—e.g. to decide whether there’s a significant different between the champagne-drinking group and the non-champagne-drinking group. That does not require that liver weights themselves be normally distributed.
That said, the CLT does provide reason to believe that something like liver weight would be normally distributed, but the OP omits a key piece of that argument: linear approximation. You do mention this briefly:
Most processes don’t “accumulate” but rather contribute in weird ways to yield their result. Growth rate of strawberries is f(lumens, water) but if you assume you can approximate f as lumens*a + water*b you’ll get some really weird situation where your strawberries die in a very damp cellar or wither away in a desert.
… but that’s not quite the whole argument, so let’s go through it properly. The argument for normality is that f is approximately linear over the range of typical variation of its inputs. So, if (in some strange units) lumens vary between 2 and 4, and water varies between 0.3 and 0.5, then we’re interested in whether f is approximately linear within that range. Extend this argument to more variables, apply CLT (subject to conditions), and we get a normal distribution. What happens in damp cellar or desert is not relevant unless those situations are within the normal range of variation of our inputs (e.g. within some particular dataset).
(The OP also complains that “We can’t determine the interval for which most processes will yield values”. This is not necessarily a problem; there’s like a gazillion versions of the CLT, and not all of them depend on bounding possible values. CLT for e.g. the Cauchy distribution even works for infinite variance.)
Now, a better argument against the CLT is this one:
Most processes in the real world, especially processes that contribute to the same outcome, are interconnected.
Even here, we can apply a linearity → normality argument as long as the errors are small relative to curvature. We model something like a metabolic network as a steady-state x with some noise ϵ: xi=fi(x,ϵi). For small ϵ, we linearize and find that Δx≈(I−∂f∂x)−1∂f∂ϵϵ, where I is an identity matrix and the partials are matrices of partial derivatives. Note that this whole thing is linear in ϵ, so just like before, we can apply the CLT (subject to conditions), and find that the distributions of each xi are roughly normal.
Takeaway: in practice, the normal approximation via CLT is really about noise being small relative to function curvature. It’s mainly a linear approximation over the typical range of the noise.
Next up, triangles.
“The left-hand side is Pyathagor’s formula, the right-hand side is this artifact which is kind of useful, but there’s no property of our mathematics that exactly define what a slightly-off right-angled triangle is or tells us it should fit this rule.”
There absolutely is a property of mathematics that tells us what a slightly-off right-angled triangle is: it’s a triangle which satisfies Pythagoras’ formula, to within some uncertainty. This is not tautological; it makes falsifiable predictions about the real world when two triangles share the same right-ish corner. For instance, I could grab a piece of printer paper and draw two different diagonal lines between the left edge and the bottom edge, defining two almost-right triangles which share their corner (the corner of the paper). Now I measure the sides of one of those two triangles very precisely, and find that they satisfy Pythagoras’ rule to within high precision—therefore the corner is very close to a right angle. Based on that, I predict that lower-precision measurements of the sides of the other triangle will also be within uncertainty of satisfying Pythagoras’ rule.
On to the next section...
I think [Named Distributions] can also be dangerous because of their simplicity, that is, the lack of parameters that can be tuned when fitting them.
Some people seem to think it’s inherently bad to use complex models instead of simple ones when avoidable, I can’t help but think that the people saying this are the same as those that say you shouldn’t quote wikipedia.
Intuitively, it’s the same idea as conservation of expected evidence: if one model predicts “it will definitely be sunny tomorrow” and another model predicts “it might be sunny or it might rain”, and it turns out to be sunny, then we must update in favor of the first model. In general, when a complex model is consistent with more possible datasets than a simple model, if we see a dataset which is consistent with the simple model, then we must update in favor of the simple model. It’s that simple. Bayesian model comparison quantifies that idea, and gives a more precise tradeoff between quality-of-fit and model complexity.
And the latter is usually what we actually use in basic analysis of experimental data—e.g. to decide whether there’s a significant different between the champagne-drinking group and the non-champagne-drinking group
I never bought up null-hypothesis testing in the liver weight example and it was not meant to illustrate that… hence why I never bought up the idea of signfiance.
Mind you, I disagree that signficance testing is done correctly, but this is not the argument against it nor is it related to it.
(The OP also complains that “We can’t determine the interval for which most processes will yield values”. This is not necessarily a problem; there’s like a gazillion versions of the CLT, and not all of them depend on bounding possible values. CLT for e.g. the Cauchy distribution even works for infinite variance.)
My argument is not that you can’t come up with a distribution for every little edge case imaginable, my argument is exactly that you CAN and you SHOULD but this process should be done automatically, because every single problem is different and we have the means to dynamically see the model that best suits every problem rather than stick to choosing between e.g. 60 names distributions.
Even here, we can apply a linearity → normality argument as long as the errors are small relative to curvature.
I fail to see your argument here, as in, I fail to see how it deals with the interconnected bit of my argument and I fail to see how noise being small is something that ever happens in a real system, in the sense you use it here, as in, noise being everything that’s not inference we are looking for.
There absolutely is a property of mathematics that tells us what a slightly-off right-angled triangle is: it’s a triangle which satisfies Pythagoras’ formula, to within some uncertainty.
But, by this definition that you use here, any arbitrary thing I want to define mathematically, even if it contains within it some amount of hand wavyness or uncertainty, can be a property of mathematics ?
Your article seems to have some assumption that increase complexity == proneness to overfitting.
Which in itself is true if you aren’t validating the model, but if you aren’t validating the model it seems to me that you’re not even in the correct game.
If you are validating the model, I don’t see how the argument holds (will look into the book tomorrow if I have time)
Intuitively, it’s the same idea as conservation of expected evidence: if one model predicts “it will definitely be sunny tomorrow” and another model predicts “it might be sunny or it might rain”, and it turns out to be sunny, then we must update in favor of the first model. In general, when a complex model is consistent with more possible datasets than a simple model, if we see a dataset which is consistent with the simple model, then we must update in favor of the simple model. It’s that simple. Bayesian model comparison quantifies that idea, and gives a more precise tradeoff between quality-of-fit and model complexity.
I fail to understand this argument and I did previously read the article mentioned here, but maybe it’s just a function of it being 1AM here, I will try again tomorrow.
Let’s start with the application of the central limit theorem to champagne drinkers. First, there’s the distinction between “liver weights are normally distributed” and “mean of a sample of liver weights is normally distributed”. The latter is much better-justified, since we compute the mean by adding a bunch of (presumably independent) random variables together. And the latter is usually what we actually use in basic analysis of experimental data—e.g. to decide whether there’s a significant different between the champagne-drinking group and the non-champagne-drinking group. That does not require that liver weights themselves be normally distributed.
I think your statement in bold font is wrong. I think in cases such as champagne drinkers vs non-champagne-drinkers people are likely to use Student’s two-sample t-test or Welch’s two-sample unequal variances t-test. It assumes that in both groups, each sample is distributed normally, not that the means are distributed normally.
No, the student’s two-sample t-test does not require that individual samples are distributed uniformly. You certainly could derive it that way, but it’s not a necessary assumption. All it actually needs is normality of the group means via CLT—see e.g. here.
This post has a lot of misconceptions in it.
Let’s start with the application of the central limit theorem to champagne drinkers. First, there’s the distinction between “liver weights are normally distributed” and “mean of a sample of liver weights is normally distributed”. The latter is much better-justified, since we compute the mean by adding a bunch of (presumably independent) random variables together. And the latter is usually what we actually use in basic analysis of experimental data—e.g. to decide whether there’s a significant different between the champagne-drinking group and the non-champagne-drinking group. That does not require that liver weights themselves be normally distributed.
That said, the CLT does provide reason to believe that something like liver weight would be normally distributed, but the OP omits a key piece of that argument: linear approximation. You do mention this briefly:
… but that’s not quite the whole argument, so let’s go through it properly. The argument for normality is that f is approximately linear over the range of typical variation of its inputs. So, if (in some strange units) lumens vary between 2 and 4, and water varies between 0.3 and 0.5, then we’re interested in whether f is approximately linear within that range. Extend this argument to more variables, apply CLT (subject to conditions), and we get a normal distribution. What happens in damp cellar or desert is not relevant unless those situations are within the normal range of variation of our inputs (e.g. within some particular dataset).
(The OP also complains that “We can’t determine the interval for which most processes will yield values”. This is not necessarily a problem; there’s like a gazillion versions of the CLT, and not all of them depend on bounding possible values. CLT for e.g. the Cauchy distribution even works for infinite variance.)
Now, a better argument against the CLT is this one:
Even here, we can apply a linearity → normality argument as long as the errors are small relative to curvature. We model something like a metabolic network as a steady-state x with some noise ϵ: xi=fi(x,ϵi). For small ϵ, we linearize and find that Δx≈(I−∂f∂x)−1∂f∂ϵϵ, where I is an identity matrix and the partials are matrices of partial derivatives. Note that this whole thing is linear in ϵ, so just like before, we can apply the CLT (subject to conditions), and find that the distributions of each xi are roughly normal.
Takeaway: in practice, the normal approximation via CLT is really about noise being small relative to function curvature. It’s mainly a linear approximation over the typical range of the noise.
Next up, triangles.
There absolutely is a property of mathematics that tells us what a slightly-off right-angled triangle is: it’s a triangle which satisfies Pythagoras’ formula, to within some uncertainty. This is not tautological; it makes falsifiable predictions about the real world when two triangles share the same right-ish corner. For instance, I could grab a piece of printer paper and draw two different diagonal lines between the left edge and the bottom edge, defining two almost-right triangles which share their corner (the corner of the paper). Now I measure the sides of one of those two triangles very precisely, and find that they satisfy Pythagoras’ rule to within high precision—therefore the corner is very close to a right angle. Based on that, I predict that lower-precision measurements of the sides of the other triangle will also be within uncertainty of satisfying Pythagoras’ rule.
On to the next section...
I fully support quoting Wikipedia, and it is inherently bad to use complex models instead of simple ones when avoidable. The relevant ideas are in chapter 20 of Jaynes’ Probability Theory: The Logic of Science, or you can read about Bayesian model comparison.
Intuitively, it’s the same idea as conservation of expected evidence: if one model predicts “it will definitely be sunny tomorrow” and another model predicts “it might be sunny or it might rain”, and it turns out to be sunny, then we must update in favor of the first model. In general, when a complex model is consistent with more possible datasets than a simple model, if we see a dataset which is consistent with the simple model, then we must update in favor of the simple model. It’s that simple. Bayesian model comparison quantifies that idea, and gives a more precise tradeoff between quality-of-fit and model complexity.
I never bought up null-hypothesis testing in the liver weight example and it was not meant to illustrate that… hence why I never bought up the idea of signfiance.
Mind you, I disagree that signficance testing is done correctly, but this is not the argument against it nor is it related to it.
My argument is not that you can’t come up with a distribution for every little edge case imaginable, my argument is exactly that you CAN and you SHOULD but this process should be done automatically, because every single problem is different and we have the means to dynamically see the model that best suits every problem rather than stick to choosing between e.g. 60 names distributions.
I fail to see your argument here, as in, I fail to see how it deals with the interconnected bit of my argument and I fail to see how noise being small is something that ever happens in a real system, in the sense you use it here, as in, noise being everything that’s not inference we are looking for.
But, by this definition that you use here, any arbitrary thing I want to define mathematically, even if it contains within it some amount of hand wavyness or uncertainty, can be a property of mathematics ?
Your article seems to have some assumption that increase complexity == proneness to overfitting.
Which in itself is true if you aren’t validating the model, but if you aren’t validating the model it seems to me that you’re not even in the correct game.
If you are validating the model, I don’t see how the argument holds (will look into the book tomorrow if I have time)
I fail to understand this argument and I did previously read the article mentioned here, but maybe it’s just a function of it being 1AM here, I will try again tomorrow.
I think your statement in bold font is wrong. I think in cases such as champagne drinkers vs non-champagne-drinkers people are likely to use Student’s two-sample t-test or Welch’s two-sample unequal variances t-test. It assumes that in both groups, each sample is distributed normally, not that the means are distributed normally.
No, the student’s two-sample t-test does not require that individual samples are distributed uniformly. You certainly could derive it that way, but it’s not a necessary assumption. All it actually needs is normality of the group means via CLT—see e.g. here.