So in broad strokes: the smaller a correlation is, the more samples you’re going to need to detect it, so the more samples you take, the more correlations you can detect. For five different human variables, this graph shows number of samples against number of correlations detected with them on a log/log scale; from that we infer that a similar slope is likely for intelligence, and so we can use it to take a guess at how many samples we’ll need to find some number of SNPs for intelligence. Am I handwaving in the right direction?
so the more samples you take, the more correlations you can detect.
Yes, although I’d phrase this more as ‘the more samples you take, the bigger your “budget”, which you can then spend on better estimates of a single variable or if you prefer, acceptable-quality estimates of several variables’.
Which one you want depends on what you’re doing. Sometimes you want one variable, other times you want more than one variable. In my self-experiments, I tend to spend my entire budget on getting good power on detecting changes in a single variable (but I could have spent my data budget in several ways: on smaller alphas or smaller effect sizes or detecting changes to multiple variables). Genomics studies like these, however, aren’t interested so much in singling out any particular gene and studying it in close detail, but finding ‘any relevant gene at all and as many as possible’.
Eh, I’m not sure the idea of ‘double-spending’ really applies here. In the multiple comparisons case, you’re spending all your budget on detecting the observed effect size and getting high-power/reducing-Type-II-errors (if there’s an effect lurking there, you’ll find it!), but you then can’t buy as much Type I error reduction as you want.
This could be fine in some applications. For example, when I’m A/B testing visual changes to gwern.net, I don’t care if I commit a Type I error, because if I replace one doohickey with another doohickey and they work equally well (the null hypothesis), all I’ve lost is a little time. I’m worried about coming up with an improvement, testing the improvement, and mistakenly believing it isn’t an improvement when actually it is.
The problem with multiple comparisons comes when people don’t realize they’ve used up their budget and they believe they really have controlled alpha errors at 5% or whatever. When they think they’ve had their cake & ate it too.
I guess a better financial analogy would be more like “you spend all your money on the new laptop you need for work, but not having checked your bank account balance, promise to take your friends out for dinner tomorrow”?
I am a bit confused—is the framework for this thread observation (where the number of samples is pretty much the only thing you can affect pre-analysis) or experiment design (where you you can greatly affect which data you collect)?
I ask because I’m intrigued by the idea of trading off Type I errors against Type II errors, but I’m not sure it’s possible in the observation context without introducing bias.
I’m not sure about this observation vs experiment design dichotomy you’re thinking of. I think of power analysis as something which can be done both before an experiment to design it and understand what the data could tell one, and post hoc, to understand why you did or did not get a result and to estimate things for designing the next experiment.
Well, I think of statistical power as the ability to distinguish signal from noise. If you expect signal of a particular strength you need to find ways to reduce the noise floor to below that strength (typically through increasing sample size).
However my standard way of thinking about this is: we have data, we build a model, we evaluate how good the model output is. Bulding a model, say, via some sort of maximum likelihood, gives you “the” fitted model with specific chances to commit a Type I or a Type II error. But can you trade off chances of Type I errors against chances of Type II errors other than through crudely adding bias to the model output?
But can you trade off chances of Type I errors against chances of Type II errors other than through crudely adding bias to the model output?
Model-building seems like a separate topic. Power analysis is for particular approaches, where I certainly can trade off Type I against Type II. Here’s a simple example for a two-group t-test, where I accept a higher Type I error rate and immediately see my Type II go down (power go up):
R> power.t.test(n=40, delta=0.5, sig.level=0.05)
Two-sample t test power calculation
n = 40
delta = 0.5
sd = 1
sig.level = 0.05
power = 0.5981
alternative = two.sided
NOTE: n is number in *each* group
R> power.t.test(n=40, delta=0.5, sig.level=0.10)
Two-sample t test power calculation
n = 40
delta = 0.5
sd = 1
sig.level = 0.1
power = 0.7163
alternative = two.sided
NOTE: n is number in *each* group
In exchange for accepting 10% Type I rather than 5%, I see my Type II fall from 1-0.60=40% to 1-0.72=28%. Tada, I have traded off errors and as far as I know, the t-test remains exactly as unbiased as it ever was.
I am not explaining myself well. Let me try again.
To even talk about Type I / II errors you need two things—a hypothesis or a prediction (generally, output of a model, possibly implicit) and reality (unobserved at prediction time). Let’s keep things very simple and deal with binary variables, let’s say we have an object foo and we want to know whether it belongs to class bar (or does not belong to it). We have a model, maybe simple and even trivial, which, when fed the object foo outputs the probability of it belonging to class bar. Let’s say this probability is 92%.
Now, at this point we are still in the probability land. Saying that “foo belongs to class bar with a probability of 92%” does not subject us to Type I / II errors. It’s only when we commit to the binary outcome and say “foo belongs to class bar, full stop” that they appear.
The point is that in probability land you can’t trade off Type I error against Type II—you just have the probability (or a full distribution in the more general case). It’s the commitment to to a certain outcome on the basis of an arbitrarily picked threshold that gives rise to them. And if so it is that threshold (e.g. traditionally 5%) that determines the trade-off between errors. Changing the threshold changes the trade-off, but this doesn’t affect the model and its output, it’s all post-prediction interpretation.
So you’re trying to talk about overall probability distributions in a Bayesian framework? I haven’t ever done power analysis with that approach, so I don’t know what would be analogous to Type I and II errors and whether one can trade them off; in fact, the only paper I can recall discussing how one does it is Kruschke’s paper (starting on pg11) - maybe he will be helpful?
Not necessarily in the Bayesian framework, though it’s kinda natural there. You can think in terms of complete distributions within the frequentist framework perfectly well, too.
The issue that we started with was of statistical power, right? While it’s technically defined in terms of the usual significance (=rejecting the null hypothesis), you can think about it in broader terms. Essentially it’s the capability to detect a signal (of certain effect size) in the presence of noise (in certain amounts) with a given level of confidence.
Thank for the paper, I’ve seen it before but didn’t have a handy link to it.
You can think in terms of complete distributions within the frequentist framework perfectly well, too.
Does anyone do that, though?
Essentially it’s the capability to detect a signal (of certain effect size) in the presence of noise (in certain amounts) with a given level of confidence.
Well, if you want to think of it like that, you could probably formulate all of this in information-theoretic terms and speak of needing a certain number of bits; then the sample size & effect size interact to say how many bits each n contains. So a binary variable contains a lot less than a continuous variable, a shift in a rare observation like 90⁄10 is going to be harder to detect than a shift in a 50⁄50 split, etc. That’s not stuff I know a lot about.
Well, sure. The frequentist approach, aka mainstream statistics, deals with distributions all the time and the arguments about particular tests or predictions being optimal, or unbiased, or asymptotically true, etc. are all explicitly conditional on characteristics of underlying distributions.
Well, if you want to think of it like that, you could probably formulate all of this in information-theoretic terms and speak of needing a certain number of bits;
Yes, something like that. Take a look at Fisher information, e.g. “The Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ upon which the probability of X depends.”
Many thanks for this!
So in broad strokes: the smaller a correlation is, the more samples you’re going to need to detect it, so the more samples you take, the more correlations you can detect. For five different human variables, this graph shows number of samples against number of correlations detected with them on a log/log scale; from that we infer that a similar slope is likely for intelligence, and so we can use it to take a guess at how many samples we’ll need to find some number of SNPs for intelligence. Am I handwaving in the right direction?
Yes, although I’d phrase this more as ‘the more samples you take, the bigger your “budget”, which you can then spend on better estimates of a single variable or if you prefer, acceptable-quality estimates of several variables’.
Which one you want depends on what you’re doing. Sometimes you want one variable, other times you want more than one variable. In my self-experiments, I tend to spend my entire budget on getting good power on detecting changes in a single variable (but I could have spent my data budget in several ways: on smaller alphas or smaller effect sizes or detecting changes to multiple variables). Genomics studies like these, however, aren’t interested so much in singling out any particular gene and studying it in close detail, but finding ‘any relevant gene at all and as many as possible’.
And there’s a “budget” because if you “double-spend”, you end up with the XKCD green acne jelly beans?
Eh, I’m not sure the idea of ‘double-spending’ really applies here. In the multiple comparisons case, you’re spending all your budget on detecting the observed effect size and getting high-power/reducing-Type-II-errors (if there’s an effect lurking there, you’ll find it!), but you then can’t buy as much Type I error reduction as you want.
This could be fine in some applications. For example, when I’m A/B testing visual changes to gwern.net, I don’t care if I commit a Type I error, because if I replace one doohickey with another doohickey and they work equally well (the null hypothesis), all I’ve lost is a little time. I’m worried about coming up with an improvement, testing the improvement, and mistakenly believing it isn’t an improvement when actually it is.
The problem with multiple comparisons comes when people don’t realize they’ve used up their budget and they believe they really have controlled alpha errors at 5% or whatever. When they think they’ve had their cake & ate it too.
I guess a better financial analogy would be more like “you spend all your money on the new laptop you need for work, but not having checked your bank account balance, promise to take your friends out for dinner tomorrow”?
I am a bit confused—is the framework for this thread observation (where the number of samples is pretty much the only thing you can affect pre-analysis) or experiment design (where you you can greatly affect which data you collect)?
I ask because I’m intrigued by the idea of trading off Type I errors against Type II errors, but I’m not sure it’s possible in the observation context without introducing bias.
I’m not sure about this observation vs experiment design dichotomy you’re thinking of. I think of power analysis as something which can be done both before an experiment to design it and understand what the data could tell one, and post hoc, to understand why you did or did not get a result and to estimate things for designing the next experiment.
Well, I think of statistical power as the ability to distinguish signal from noise. If you expect signal of a particular strength you need to find ways to reduce the noise floor to below that strength (typically through increasing sample size).
However my standard way of thinking about this is: we have data, we build a model, we evaluate how good the model output is. Bulding a model, say, via some sort of maximum likelihood, gives you “the” fitted model with specific chances to commit a Type I or a Type II error. But can you trade off chances of Type I errors against chances of Type II errors other than through crudely adding bias to the model output?
Model-building seems like a separate topic. Power analysis is for particular approaches, where I certainly can trade off Type I against Type II. Here’s a simple example for a two-group t-test, where I accept a higher Type I error rate and immediately see my Type II go down (power go up):
In exchange for accepting 10% Type I rather than 5%, I see my Type II fall from 1-0.60=40% to 1-0.72=28%. Tada, I have traded off errors and as far as I know, the t-test remains exactly as unbiased as it ever was.
I am not explaining myself well. Let me try again.
To even talk about Type I / II errors you need two things—a hypothesis or a prediction (generally, output of a model, possibly implicit) and reality (unobserved at prediction time). Let’s keep things very simple and deal with binary variables, let’s say we have an object foo and we want to know whether it belongs to class bar (or does not belong to it). We have a model, maybe simple and even trivial, which, when fed the object foo outputs the probability of it belonging to class bar. Let’s say this probability is 92%.
Now, at this point we are still in the probability land. Saying that “foo belongs to class bar with a probability of 92%” does not subject us to Type I / II errors. It’s only when we commit to the binary outcome and say “foo belongs to class bar, full stop” that they appear.
The point is that in probability land you can’t trade off Type I error against Type II—you just have the probability (or a full distribution in the more general case). It’s the commitment to to a certain outcome on the basis of an arbitrarily picked threshold that gives rise to them. And if so it is that threshold (e.g. traditionally 5%) that determines the trade-off between errors. Changing the threshold changes the trade-off, but this doesn’t affect the model and its output, it’s all post-prediction interpretation.
So you’re trying to talk about overall probability distributions in a Bayesian framework? I haven’t ever done power analysis with that approach, so I don’t know what would be analogous to Type I and II errors and whether one can trade them off; in fact, the only paper I can recall discussing how one does it is Kruschke’s paper (starting on pg11) - maybe he will be helpful?
Not necessarily in the Bayesian framework, though it’s kinda natural there. You can think in terms of complete distributions within the frequentist framework perfectly well, too.
The issue that we started with was of statistical power, right? While it’s technically defined in terms of the usual significance (=rejecting the null hypothesis), you can think about it in broader terms. Essentially it’s the capability to detect a signal (of certain effect size) in the presence of noise (in certain amounts) with a given level of confidence.
Thank for the paper, I’ve seen it before but didn’t have a handy link to it.
Does anyone do that, though?
Well, if you want to think of it like that, you could probably formulate all of this in information-theoretic terms and speak of needing a certain number of bits; then the sample size & effect size interact to say how many bits each n contains. So a binary variable contains a lot less than a continuous variable, a shift in a rare observation like 90⁄10 is going to be harder to detect than a shift in a 50⁄50 split, etc. That’s not stuff I know a lot about.
Well, sure. The frequentist approach, aka mainstream statistics, deals with distributions all the time and the arguments about particular tests or predictions being optimal, or unbiased, or asymptotically true, etc. are all explicitly conditional on characteristics of underlying distributions.
Yes, something like that. Take a look at Fisher information, e.g. “The Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ upon which the probability of X depends.”