This graph displays the number of GWAS hits versus sample size for height, BMI, etc. Once the minimal sample size to discover the alleles of largest impact (large MAF, large effect size) is exceeded, one generally expects a steady accumulation of new hits at lower MAF / effect size. I expect the same sort of progress for g. (MAF = Minor Allele Frequency. Variants that are common in the population are easier to detect than rare variants.)
We can’t predict the sample size required to obtain most of the additive variance for g (this depends on the details of the distribution of alleles), but I would guess that about a million genotypes together with associated g scores will suffice. When, exactly, we will reach this sample size is unclear, but I think most of the difficulty is in obtaining the phenotype data. Within a few years, over a million people will have been genotyped, but probably we will only have g scores for a small fraction of the individuals.
I’ll try to explain it in different terms. What you are looking at is a graph of ‘results vs effort’. How much work do you have to do to get out some useful results? The importance of this is that it’s showing you a visual version of statistical power analysis (introduction).
Ordinary power analysis is about examining the inherent zero-sum trade-offs of power vs sample size vs effect size vs statistical-significance, where you try to optimize each thing for one’s particular purpose; so for example, you can choose to have a small (=cheap) sample size and a small Type I (false positives) error rate in detecting a small effect size—as long as you don’t mind a huge Type II error rate (low power, false negative, failure to detect real effects).
If you look at my nootropics or sleep experiments, you’ll see I do power analysis all the time as a way of understanding how big my experiments need to be before they are not worthlessly uninformative; if your sample size is too small, you simply won’t observe anything, even if there really is an effect (eg. you might conclude, ‘with such a small n as 23, at the predicted effect size and the usual alpha of 0.05, our power will be very low, like 10%, so the experiment would be a waste of time’).
Even though we know intelligence is very influenced by genes, you can’t find ‘the genes for intelligence’ by looking at just 10 people—but how many do you need to look at?
In the case of the graph, the statistical-significance is hardwired & the effect sizes are all known to be small, and we ignore power, so that leaves two variables: sample size and number of null-rejection/findings. The graph shows us simply that as we get a larger sample, we can successfully find more associations (because we have more power to get a subtle genetic effect to pass our significance cutoffs). Simple enough. It’s not news to anyone that the more data you collect, the more results you get.
What’s useful here is that the slope of the points is encoding the joint relationship of power & significance & effect size for genetic findings, so we can simply vary sample size and spit out estimated number of findings. The intercept remains uncertain, though. What Hsu finds so important about this graph is that it lets us predict for intelligence how many hits we will get at any sample size once we have a datapoint which then nails down a unique line. What’s the datapoint? Well, he mentions the very interesting recent findings of ~3 associations—which happened at n=126k. So to plot this IQ datapoint and guessing at roughly where it would go (please pardon my Paint usage):
OK, but how does that let Hsu predict anything? Well, the slope ought to be the same for future IQ findings, since the procedures are basically the same. So all we have to do is guess at the line, and anchor it on this new finding:
So if you want to know what we’ll find at 200000 samples, you extend the line and it looks like we’ll have ~10 SNPs at that point. Or, if you wanted to know when we’ll have found 100 SNPs for intelligence, you simply continue extending the line until it reaches 100 on the y-axis, which apparently Hsu thinks will happen somewhere around 1000000 on the x-axis (which extends off the screen because no one has collected that big a sample yet for anything else, much less intelligence).
I hope that helps; if you don’t understand power, it might help to look at my own little analyses where the problem is usually much simpler.
So in broad strokes: the smaller a correlation is, the more samples you’re going to need to detect it, so the more samples you take, the more correlations you can detect. For five different human variables, this graph shows number of samples against number of correlations detected with them on a log/log scale; from that we infer that a similar slope is likely for intelligence, and so we can use it to take a guess at how many samples we’ll need to find some number of SNPs for intelligence. Am I handwaving in the right direction?
so the more samples you take, the more correlations you can detect.
Yes, although I’d phrase this more as ‘the more samples you take, the bigger your “budget”, which you can then spend on better estimates of a single variable or if you prefer, acceptable-quality estimates of several variables’.
Which one you want depends on what you’re doing. Sometimes you want one variable, other times you want more than one variable. In my self-experiments, I tend to spend my entire budget on getting good power on detecting changes in a single variable (but I could have spent my data budget in several ways: on smaller alphas or smaller effect sizes or detecting changes to multiple variables). Genomics studies like these, however, aren’t interested so much in singling out any particular gene and studying it in close detail, but finding ‘any relevant gene at all and as many as possible’.
Eh, I’m not sure the idea of ‘double-spending’ really applies here. In the multiple comparisons case, you’re spending all your budget on detecting the observed effect size and getting high-power/reducing-Type-II-errors (if there’s an effect lurking there, you’ll find it!), but you then can’t buy as much Type I error reduction as you want.
This could be fine in some applications. For example, when I’m A/B testing visual changes to gwern.net, I don’t care if I commit a Type I error, because if I replace one doohickey with another doohickey and they work equally well (the null hypothesis), all I’ve lost is a little time. I’m worried about coming up with an improvement, testing the improvement, and mistakenly believing it isn’t an improvement when actually it is.
The problem with multiple comparisons comes when people don’t realize they’ve used up their budget and they believe they really have controlled alpha errors at 5% or whatever. When they think they’ve had their cake & ate it too.
I guess a better financial analogy would be more like “you spend all your money on the new laptop you need for work, but not having checked your bank account balance, promise to take your friends out for dinner tomorrow”?
I am a bit confused—is the framework for this thread observation (where the number of samples is pretty much the only thing you can affect pre-analysis) or experiment design (where you you can greatly affect which data you collect)?
I ask because I’m intrigued by the idea of trading off Type I errors against Type II errors, but I’m not sure it’s possible in the observation context without introducing bias.
I’m not sure about this observation vs experiment design dichotomy you’re thinking of. I think of power analysis as something which can be done both before an experiment to design it and understand what the data could tell one, and post hoc, to understand why you did or did not get a result and to estimate things for designing the next experiment.
Well, I think of statistical power as the ability to distinguish signal from noise. If you expect signal of a particular strength you need to find ways to reduce the noise floor to below that strength (typically through increasing sample size).
However my standard way of thinking about this is: we have data, we build a model, we evaluate how good the model output is. Bulding a model, say, via some sort of maximum likelihood, gives you “the” fitted model with specific chances to commit a Type I or a Type II error. But can you trade off chances of Type I errors against chances of Type II errors other than through crudely adding bias to the model output?
But can you trade off chances of Type I errors against chances of Type II errors other than through crudely adding bias to the model output?
Model-building seems like a separate topic. Power analysis is for particular approaches, where I certainly can trade off Type I against Type II. Here’s a simple example for a two-group t-test, where I accept a higher Type I error rate and immediately see my Type II go down (power go up):
R> power.t.test(n=40, delta=0.5, sig.level=0.05)
Two-sample t test power calculation
n = 40
delta = 0.5
sd = 1
sig.level = 0.05
power = 0.5981
alternative = two.sided
NOTE: n is number in *each* group
R> power.t.test(n=40, delta=0.5, sig.level=0.10)
Two-sample t test power calculation
n = 40
delta = 0.5
sd = 1
sig.level = 0.1
power = 0.7163
alternative = two.sided
NOTE: n is number in *each* group
In exchange for accepting 10% Type I rather than 5%, I see my Type II fall from 1-0.60=40% to 1-0.72=28%. Tada, I have traded off errors and as far as I know, the t-test remains exactly as unbiased as it ever was.
I am not explaining myself well. Let me try again.
To even talk about Type I / II errors you need two things—a hypothesis or a prediction (generally, output of a model, possibly implicit) and reality (unobserved at prediction time). Let’s keep things very simple and deal with binary variables, let’s say we have an object foo and we want to know whether it belongs to class bar (or does not belong to it). We have a model, maybe simple and even trivial, which, when fed the object foo outputs the probability of it belonging to class bar. Let’s say this probability is 92%.
Now, at this point we are still in the probability land. Saying that “foo belongs to class bar with a probability of 92%” does not subject us to Type I / II errors. It’s only when we commit to the binary outcome and say “foo belongs to class bar, full stop” that they appear.
The point is that in probability land you can’t trade off Type I error against Type II—you just have the probability (or a full distribution in the more general case). It’s the commitment to to a certain outcome on the basis of an arbitrarily picked threshold that gives rise to them. And if so it is that threshold (e.g. traditionally 5%) that determines the trade-off between errors. Changing the threshold changes the trade-off, but this doesn’t affect the model and its output, it’s all post-prediction interpretation.
So you’re trying to talk about overall probability distributions in a Bayesian framework? I haven’t ever done power analysis with that approach, so I don’t know what would be analogous to Type I and II errors and whether one can trade them off; in fact, the only paper I can recall discussing how one does it is Kruschke’s paper (starting on pg11) - maybe he will be helpful?
Not necessarily in the Bayesian framework, though it’s kinda natural there. You can think in terms of complete distributions within the frequentist framework perfectly well, too.
The issue that we started with was of statistical power, right? While it’s technically defined in terms of the usual significance (=rejecting the null hypothesis), you can think about it in broader terms. Essentially it’s the capability to detect a signal (of certain effect size) in the presence of noise (in certain amounts) with a given level of confidence.
Thank for the paper, I’ve seen it before but didn’t have a handy link to it.
You can think in terms of complete distributions within the frequentist framework perfectly well, too.
Does anyone do that, though?
Essentially it’s the capability to detect a signal (of certain effect size) in the presence of noise (in certain amounts) with a given level of confidence.
Well, if you want to think of it like that, you could probably formulate all of this in information-theoretic terms and speak of needing a certain number of bits; then the sample size & effect size interact to say how many bits each n contains. So a binary variable contains a lot less than a continuous variable, a shift in a rare observation like 90⁄10 is going to be harder to detect than a shift in a 50⁄50 split, etc. That’s not stuff I know a lot about.
Well, sure. The frequentist approach, aka mainstream statistics, deals with distributions all the time and the arguments about particular tests or predictions being optimal, or unbiased, or asymptotically true, etc. are all explicitly conditional on characteristics of underlying distributions.
Well, if you want to think of it like that, you could probably formulate all of this in information-theoretic terms and speak of needing a certain number of bits;
Yes, something like that. Take a look at Fisher information, e.g. “The Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ upon which the probability of X depends.”
So to first quote Hsu’s description:
I’ll try to explain it in different terms. What you are looking at is a graph of ‘results vs effort’. How much work do you have to do to get out some useful results? The importance of this is that it’s showing you a visual version of statistical power analysis (introduction).
Ordinary power analysis is about examining the inherent zero-sum trade-offs of power vs sample size vs effect size vs statistical-significance, where you try to optimize each thing for one’s particular purpose; so for example, you can choose to have a small (=cheap) sample size and a small Type I (false positives) error rate in detecting a small effect size—as long as you don’t mind a huge Type II error rate (low power, false negative, failure to detect real effects).
If you look at my nootropics or sleep experiments, you’ll see I do power analysis all the time as a way of understanding how big my experiments need to be before they are not worthlessly uninformative; if your sample size is too small, you simply won’t observe anything, even if there really is an effect (eg. you might conclude, ‘with such a small n as 23, at the predicted effect size and the usual alpha of 0.05, our power will be very low, like 10%, so the experiment would be a waste of time’).
Even though we know intelligence is very influenced by genes, you can’t find ‘the genes for intelligence’ by looking at just 10 people—but how many do you need to look at?
In the case of the graph, the statistical-significance is hardwired & the effect sizes are all known to be small, and we ignore power, so that leaves two variables: sample size and number of null-rejection/findings. The graph shows us simply that as we get a larger sample, we can successfully find more associations (because we have more power to get a subtle genetic effect to pass our significance cutoffs). Simple enough. It’s not news to anyone that the more data you collect, the more results you get.
What’s useful here is that the slope of the points is encoding the joint relationship of power & significance & effect size for genetic findings, so we can simply vary sample size and spit out estimated number of findings. The intercept remains uncertain, though. What Hsu finds so important about this graph is that it lets us predict for intelligence how many hits we will get at any sample size once we have a datapoint which then nails down a unique line. What’s the datapoint? Well, he mentions the very interesting recent findings of ~3 associations—which happened at n=126k. So to plot this IQ datapoint and guessing at roughly where it would go (please pardon my Paint usage):
OK, but how does that let Hsu predict anything? Well, the slope ought to be the same for future IQ findings, since the procedures are basically the same. So all we have to do is guess at the line, and anchor it on this new finding:
So if you want to know what we’ll find at 200000 samples, you extend the line and it looks like we’ll have ~10 SNPs at that point. Or, if you wanted to know when we’ll have found 100 SNPs for intelligence, you simply continue extending the line until it reaches 100 on the y-axis, which apparently Hsu thinks will happen somewhere around 1000000 on the x-axis (which extends off the screen because no one has collected that big a sample yet for anything else, much less intelligence).
I hope that helps; if you don’t understand power, it might help to look at my own little analyses where the problem is usually much simpler.
Many thanks for this!
So in broad strokes: the smaller a correlation is, the more samples you’re going to need to detect it, so the more samples you take, the more correlations you can detect. For five different human variables, this graph shows number of samples against number of correlations detected with them on a log/log scale; from that we infer that a similar slope is likely for intelligence, and so we can use it to take a guess at how many samples we’ll need to find some number of SNPs for intelligence. Am I handwaving in the right direction?
Yes, although I’d phrase this more as ‘the more samples you take, the bigger your “budget”, which you can then spend on better estimates of a single variable or if you prefer, acceptable-quality estimates of several variables’.
Which one you want depends on what you’re doing. Sometimes you want one variable, other times you want more than one variable. In my self-experiments, I tend to spend my entire budget on getting good power on detecting changes in a single variable (but I could have spent my data budget in several ways: on smaller alphas or smaller effect sizes or detecting changes to multiple variables). Genomics studies like these, however, aren’t interested so much in singling out any particular gene and studying it in close detail, but finding ‘any relevant gene at all and as many as possible’.
And there’s a “budget” because if you “double-spend”, you end up with the XKCD green acne jelly beans?
Eh, I’m not sure the idea of ‘double-spending’ really applies here. In the multiple comparisons case, you’re spending all your budget on detecting the observed effect size and getting high-power/reducing-Type-II-errors (if there’s an effect lurking there, you’ll find it!), but you then can’t buy as much Type I error reduction as you want.
This could be fine in some applications. For example, when I’m A/B testing visual changes to gwern.net, I don’t care if I commit a Type I error, because if I replace one doohickey with another doohickey and they work equally well (the null hypothesis), all I’ve lost is a little time. I’m worried about coming up with an improvement, testing the improvement, and mistakenly believing it isn’t an improvement when actually it is.
The problem with multiple comparisons comes when people don’t realize they’ve used up their budget and they believe they really have controlled alpha errors at 5% or whatever. When they think they’ve had their cake & ate it too.
I guess a better financial analogy would be more like “you spend all your money on the new laptop you need for work, but not having checked your bank account balance, promise to take your friends out for dinner tomorrow”?
I am a bit confused—is the framework for this thread observation (where the number of samples is pretty much the only thing you can affect pre-analysis) or experiment design (where you you can greatly affect which data you collect)?
I ask because I’m intrigued by the idea of trading off Type I errors against Type II errors, but I’m not sure it’s possible in the observation context without introducing bias.
I’m not sure about this observation vs experiment design dichotomy you’re thinking of. I think of power analysis as something which can be done both before an experiment to design it and understand what the data could tell one, and post hoc, to understand why you did or did not get a result and to estimate things for designing the next experiment.
Well, I think of statistical power as the ability to distinguish signal from noise. If you expect signal of a particular strength you need to find ways to reduce the noise floor to below that strength (typically through increasing sample size).
However my standard way of thinking about this is: we have data, we build a model, we evaluate how good the model output is. Bulding a model, say, via some sort of maximum likelihood, gives you “the” fitted model with specific chances to commit a Type I or a Type II error. But can you trade off chances of Type I errors against chances of Type II errors other than through crudely adding bias to the model output?
Model-building seems like a separate topic. Power analysis is for particular approaches, where I certainly can trade off Type I against Type II. Here’s a simple example for a two-group t-test, where I accept a higher Type I error rate and immediately see my Type II go down (power go up):
In exchange for accepting 10% Type I rather than 5%, I see my Type II fall from 1-0.60=40% to 1-0.72=28%. Tada, I have traded off errors and as far as I know, the t-test remains exactly as unbiased as it ever was.
I am not explaining myself well. Let me try again.
To even talk about Type I / II errors you need two things—a hypothesis or a prediction (generally, output of a model, possibly implicit) and reality (unobserved at prediction time). Let’s keep things very simple and deal with binary variables, let’s say we have an object foo and we want to know whether it belongs to class bar (or does not belong to it). We have a model, maybe simple and even trivial, which, when fed the object foo outputs the probability of it belonging to class bar. Let’s say this probability is 92%.
Now, at this point we are still in the probability land. Saying that “foo belongs to class bar with a probability of 92%” does not subject us to Type I / II errors. It’s only when we commit to the binary outcome and say “foo belongs to class bar, full stop” that they appear.
The point is that in probability land you can’t trade off Type I error against Type II—you just have the probability (or a full distribution in the more general case). It’s the commitment to to a certain outcome on the basis of an arbitrarily picked threshold that gives rise to them. And if so it is that threshold (e.g. traditionally 5%) that determines the trade-off between errors. Changing the threshold changes the trade-off, but this doesn’t affect the model and its output, it’s all post-prediction interpretation.
So you’re trying to talk about overall probability distributions in a Bayesian framework? I haven’t ever done power analysis with that approach, so I don’t know what would be analogous to Type I and II errors and whether one can trade them off; in fact, the only paper I can recall discussing how one does it is Kruschke’s paper (starting on pg11) - maybe he will be helpful?
Not necessarily in the Bayesian framework, though it’s kinda natural there. You can think in terms of complete distributions within the frequentist framework perfectly well, too.
The issue that we started with was of statistical power, right? While it’s technically defined in terms of the usual significance (=rejecting the null hypothesis), you can think about it in broader terms. Essentially it’s the capability to detect a signal (of certain effect size) in the presence of noise (in certain amounts) with a given level of confidence.
Thank for the paper, I’ve seen it before but didn’t have a handy link to it.
Does anyone do that, though?
Well, if you want to think of it like that, you could probably formulate all of this in information-theoretic terms and speak of needing a certain number of bits; then the sample size & effect size interact to say how many bits each n contains. So a binary variable contains a lot less than a continuous variable, a shift in a rare observation like 90⁄10 is going to be harder to detect than a shift in a 50⁄50 split, etc. That’s not stuff I know a lot about.
Well, sure. The frequentist approach, aka mainstream statistics, deals with distributions all the time and the arguments about particular tests or predictions being optimal, or unbiased, or asymptotically true, etc. are all explicitly conditional on characteristics of underlying distributions.
Yes, something like that. Take a look at Fisher information, e.g. “The Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ upon which the probability of X depends.”