The paper you called largest ever GWAS gave a direct h^2 estimate of 0.05 for cognitive performance. How are these papers getting 0.2? I don’t understand what they’re doing. Some type of meta analysis?
The test-retest reliability you linked has different reliabilities for different subtests. The correct adjustment depends on which subtests are being used. If cognitive performance is some kind of sumscore of the subtests, its reliability would be higher than for the individual subtests.
Also, I don’t think the calculation 0.2*(0.9/0.6)^2 is the correct adjustment. A test-retest correlation is already essentially the square of a correlation of the test with an underlying latent factor (both the test AND the retest have error). E.g. if a test T can be written as
T=aX+sqrt(1-a)E
where X is ability and E is error (all with standard deviation 1 and the error independent of the ability), then a correlation of T with a resample of T (with new independent error but same ability) would be a^2. But the adjustment to h^2 should be proportional to a^2, so it should be proportional to the test-retest correlation, not the square of the test-retest correlation. Am I getting this wrong?
The paper you called largest ever GWAS gave a direct h^2 estimate of 0.05 for cognitive performance. How are these papers getting 0.2? I don’t understand what they’re doing. Some type of meta analysis?
You’re mixing up h^2 estimates with predictor R^2 performance. It’s possible to get an estimate of h^2 with much less statistical power than it takes to build a predictor that good.
The test-retest reliability you linked has different reliabilities for different subtests. The correct adjustment depends on which subtests are being used. If cognitive performance is some kind of sumscore of the subtests, its reliability would be higher than for the individual subtests.
“Fluid IQ” was the only subtest used.
Also, I don’t think the calculation 0.2*(0.9/0.6)^2 is the correct adjustment. A test-retest correlation is already essentially the square of a correlation of the test with an underlying latent factor
Good catch, we’ll fix this when we revise the post.
You’re mixing up h^2 estimates with predictor R^2 performance. It’s possible to get an estimate of h^2 with much less statistical power than it takes to build a predictor that good.
Thanks. I understand now. But isn’t the R^2 the relevant measure? You don’t know which genes to edit to get the h^2 number (nor do you know what to select on). You’re doing the calculation 0.2*(0.9/0.6)^2 when the relevant calculation is something like 0.05*(0.9/0.6). Off by a factor of 6 for the power of selection, or sqrt(6)=2.45 for the power of editing
Not for this purpose! The simulation pipeline is as follows: the assumed h^2 and number of causal variants is used to generate the genetic effects → generate simulated GWASes for a range of sample sizes → infer causal effects from the observed GWASes → select top expected effect variants for up to N (expected) edits.
With a method similar to this. You can easily compute the exact likelihood function P(GWAS results | SNP effects), which when combined with a prior over SNP effects (informed by what we know about the genetic architecture of the trait) gives you a posterior probability of each SNP being causal (having nonzero effect), and its expected effect size conditional on being causal (you can’t actually calculate the full posterior since there are 2^|SNPs| possible combinations of SNPs with nonzero effects, so you need to do some sort of MCMC or stochastic search). We may make a post going into more detail on our methods at some point.
You should show your calculation or your code, including all the data and parameter choices. Otherwise I can’t evaluate this.
I assume you’re picking parameters to exaggerate the effects, because just from the exaggerations you’ve already conceded (0.9/0.6 shouldn’t be squared and attenuation to get direct effects should be 0.824), you’ve already exaggerated the results by a factor of sqrt(0.9/0.6)/0.824 for editing, which is around a 50% overestimate.
I don’t think that was deliberate on your part, but I think wishful thinking and the desire to paint a compelling story (and get funding) is causing you to be biased in what you adjust for and in which mistakes you catch. It’s natural in your position to scrutinize low estimates but not high ones. So to trust your numbers I’d need to understand how you got them.
There is one saving grace for us which is that the predictor we used is significantly less powerful than ones we know to exist.
I think when you account for both the squaring issue, the indirect effect things, and the more powerful predictors, they’re going to roughly cancel out.
Granted, the more powerful predictor itself isn’t published, so we can’t rigorously evaluate it either which isn’t ideal. I think the way to deal with this is to show a few lines: one for the “current publicly available GWAS”, one showing a rough estimate of the gain using the privately developed predictor (which with enough work we could probably replicate), and then one or two more for different amounts of data.
All of this together WILL still reduce the “best case scenario” from editing relative to what we originally published (because with the better predictor we’re closer to “perfect knowledge” than where we were with the previous predictor.
At some point we’re going to re-run the calculations and publish an actual proper writeup on our methodology (likely with our code).
Also I just want to say thank you for taking the time to dive deep into this with us. One of the main reasons I post on LessWrong is because there is such high quality feedback relative to other sites.
You should show your calculation or your code, including all the data and parameter choices. Otherwise I can’t evaluate this.
The code is pretty complicated and not something I’d expect a non-expert (even a very smart one) to be able to quickly check over; it’s not just a 100 line python script. (Or even a very smart expert for that matter, more like anyone who wasn’t already familiar with our particular codebase.) We’ll likely open source it at some point in the future, possibly soon, but that’s not decided yet. Our finemapping (inferring causal effects) procedure produces ~identical results to the software from the paper I linked above when run on the same test data (though we handle some additional things like variable per-SNP sample sizes and missing SNPs which that finemapper doesn’t handle, which is why we didn’t just use it).
The parameter choices which determine the prior over SNP effects are the number of causal SNPs (which we set to 20,000) and the SNP heritability of the phenotype (which we set to 0.19, as per the GWAS we used). The erroneous effect size adjustment was done at the end to convert from the effect sizes of the GWAS phenotype (low reliability IQ test) to the effect sizes corresponding to the phenotype we care about (high reliability IQ test).
We want to publish a more detailed write up of our methods soon(ish), but it’s going to be a fair bit of work so don’t expect it overnight.
It’s natural in your position to scrutinize low estimates but not high ones.
Yep, fair enough. I’ve noticed myself doing this sometimes and I want to cut it out. That said, I don’t think small-ish predictable overestimates to the effect sizes are going to change the qualitative picture, since with good enough data and a few hundred to a thousand edits we can boost predicted IQ by >6 SD even with much more pessimistic assumptions, which probably isn’t even safe to do (I’m not sure I expect additivity to hold that far). I’m much more worried about basic problems with our modelling assumptions, e.g. the assumption of sparse causal SNPs with additive effects and no interactions (e.g. what if rare haplotypes are deleterious due to interactions that don’t show up in GWAS since those combinations are rare?).
The paper you called largest ever GWAS gave a direct h^2 estimate of 0.05 for cognitive performance. How are these papers getting 0.2? I don’t understand what they’re doing. Some type of meta analysis?
The test-retest reliability you linked has different reliabilities for different subtests. The correct adjustment depends on which subtests are being used. If cognitive performance is some kind of sumscore of the subtests, its reliability would be higher than for the individual subtests.
Also, I don’t think the calculation 0.2*(0.9/0.6)^2 is the correct adjustment. A test-retest correlation is already essentially the square of a correlation of the test with an underlying latent factor (both the test AND the retest have error). E.g. if a test T can be written as
T=aX+sqrt(1-a)E
where X is ability and E is error (all with standard deviation 1 and the error independent of the ability), then a correlation of T with a resample of T (with new independent error but same ability) would be a^2. But the adjustment to h^2 should be proportional to a^2, so it should be proportional to the test-retest correlation, not the square of the test-retest correlation. Am I getting this wrong?
You’re mixing up h^2 estimates with predictor R^2 performance. It’s possible to get an estimate of h^2 with much less statistical power than it takes to build a predictor that good.
“Fluid IQ” was the only subtest used.
Good catch, we’ll fix this when we revise the post.
Thanks. I understand now. But isn’t the R^2 the relevant measure? You don’t know which genes to edit to get the h^2 number (nor do you know what to select on). You’re doing the calculation 0.2*(0.9/0.6)^2 when the relevant calculation is something like 0.05*(0.9/0.6). Off by a factor of 6 for the power of selection, or sqrt(6)=2.45 for the power of editing
Not for this purpose! The simulation pipeline is as follows: the assumed h^2 and number of causal variants is used to generate the genetic effects → generate simulated GWASes for a range of sample sizes → infer causal effects from the observed GWASes → select top expected effect variants for up to N (expected) edits.
I’m talking about this graph:
What are the calculations used for this graph. Text says to see the appendix but the appendix does not actually explain how you got this graph.
This is based on inferring causal effects conditional on this GWAS. The assumed heritability affects the prior over SNP effect sizes.
I don’t understand. Can you explain how you’re inferring the SNP effect sizes?
With a method similar to this. You can easily compute the exact likelihood function P(GWAS results | SNP effects), which when combined with a prior over SNP effects (informed by what we know about the genetic architecture of the trait) gives you a posterior probability of each SNP being causal (having nonzero effect), and its expected effect size conditional on being causal (you can’t actually calculate the full posterior since there are 2^|SNPs| possible combinations of SNPs with nonzero effects, so you need to do some sort of MCMC or stochastic search). We may make a post going into more detail on our methods at some point.
You should show your calculation or your code, including all the data and parameter choices. Otherwise I can’t evaluate this.
I assume you’re picking parameters to exaggerate the effects, because just from the exaggerations you’ve already conceded (0.9/0.6 shouldn’t be squared and attenuation to get direct effects should be 0.824), you’ve already exaggerated the results by a factor of sqrt(0.9/0.6)/0.824 for editing, which is around a 50% overestimate.
I don’t think that was deliberate on your part, but I think wishful thinking and the desire to paint a compelling story (and get funding) is causing you to be biased in what you adjust for and in which mistakes you catch. It’s natural in your position to scrutinize low estimates but not high ones. So to trust your numbers I’d need to understand how you got them.
There is one saving grace for us which is that the predictor we used is significantly less powerful than ones we know to exist.
I think when you account for both the squaring issue, the indirect effect things, and the more powerful predictors, they’re going to roughly cancel out.
Granted, the more powerful predictor itself isn’t published, so we can’t rigorously evaluate it either which isn’t ideal. I think the way to deal with this is to show a few lines: one for the “current publicly available GWAS”, one showing a rough estimate of the gain using the privately developed predictor (which with enough work we could probably replicate), and then one or two more for different amounts of data.
All of this together WILL still reduce the “best case scenario” from editing relative to what we originally published (because with the better predictor we’re closer to “perfect knowledge” than where we were with the previous predictor.
At some point we’re going to re-run the calculations and publish an actual proper writeup on our methodology (likely with our code).
Also I just want to say thank you for taking the time to dive deep into this with us. One of the main reasons I post on LessWrong is because there is such high quality feedback relative to other sites.
The code is pretty complicated and not something I’d expect a non-expert (even a very smart one) to be able to quickly check over; it’s not just a 100 line python script. (Or even a very smart expert for that matter, more like anyone who wasn’t already familiar with our particular codebase.) We’ll likely open source it at some point in the future, possibly soon, but that’s not decided yet. Our finemapping (inferring causal effects) procedure produces ~identical results to the software from the paper I linked above when run on the same test data (though we handle some additional things like variable per-SNP sample sizes and missing SNPs which that finemapper doesn’t handle, which is why we didn’t just use it).
The parameter choices which determine the prior over SNP effects are the number of causal SNPs (which we set to 20,000) and the SNP heritability of the phenotype (which we set to 0.19, as per the GWAS we used). The erroneous effect size adjustment was done at the end to convert from the effect sizes of the GWAS phenotype (low reliability IQ test) to the effect sizes corresponding to the phenotype we care about (high reliability IQ test).
We want to publish a more detailed write up of our methods soon(ish), but it’s going to be a fair bit of work so don’t expect it overnight.
Yep, fair enough. I’ve noticed myself doing this sometimes and I want to cut it out. That said, I don’t think small-ish predictable overestimates to the effect sizes are going to change the qualitative picture, since with good enough data and a few hundred to a thousand edits we can boost predicted IQ by >6 SD even with much more pessimistic assumptions, which probably isn’t even safe to do (I’m not sure I expect additivity to hold that far). I’m much more worried about basic problems with our modelling assumptions, e.g. the assumption of sparse causal SNPs with additive effects and no interactions (e.g. what if rare haplotypes are deleterious due to interactions that don’t show up in GWAS since those combinations are rare?).