I don’t understand. Can you explain how you’re inferring the SNP effect sizes?
LGS
I’m talking about this graph:
What are the calculations used for this graph. Text says to see the appendix but the appendix does not actually explain how you got this graph.
You’re mixing up h^2 estimates with predictor R^2 performance. It’s possible to get an estimate of h^2 with much less statistical power than it takes to build a predictor that good.
Thanks. I understand now. But isn’t the R^2 the relevant measure? You don’t know which genes to edit to get the h^2 number (nor do you know what to select on). You’re doing the calculation 0.2*(0.9/0.6)^2 when the relevant calculation is something like 0.05*(0.9/0.6). Off by a factor of 6 for the power of selection, or sqrt(6)=2.45 for the power of editing
The paper you called largest ever GWAS gave a direct h^2 estimate of 0.05 for cognitive performance. How are these papers getting 0.2? I don’t understand what they’re doing. Some type of meta analysis?
The test-retest reliability you linked has different reliabilities for different subtests. The correct adjustment depends on which subtests are being used. If cognitive performance is some kind of sumscore of the subtests, its reliability would be higher than for the individual subtests.
Also, I don’t think the calculation 0.2*(0.9/0.6)^2 is the correct adjustment. A test-retest correlation is already essentially the square of a correlation of the test with an underlying latent factor (both the test AND the retest have error). E.g. if a test T can be written as
T=aX+sqrt(1-a)E
where X is ability and E is error (all with standard deviation 1 and the error independent of the ability), then a correlation of T with a resample of T (with new independent error but same ability) would be a^2. But the adjustment to h^2 should be proportional to a^2, so it should be proportional to the test-retest correlation, not the square of the test-retest correlation. Am I getting this wrong?
Thanks! I understand their numbers a bit better, then. Still, direct effects of cognitive performance explain 5% of variance. Can’t multiply the variance explained of EA by the attenuation of cognitive performance!
Do you have evidence for direct effects of either one of them being higher than 5% of variance?
I don’t quite understand your numbers in the OP but it feels like you’re inflating them substantially. Is the full calculation somewhere?
You should decide whether you’re using a GWAS on cognitive performance or on educational attainment (EA). This paper you linked is using a GWAS for EA, and finding that very little of the predictive power was direct effects. Exactly the opposite of your claim:
For predicting EA, the ratio of direct to population effect estimates is 0.556 (s.e. = 0.020), implying that 100% × 0.5562 = 30.9% of the PGI’s R2 is due to its direct effect.
Then they compare this to cognitive performance. For cognitive performance, the ratio was better, but it’s not 0.824, it’s . But actually, even this is possibly too high: the table in figure 4 has a ratio that looks much smaller than this, and refers to supplementary table 10 for numbers. I checked supplementary table 10, and it says that the “direct-population ratio” is 0.656, not 0.824. So quite possibly the right value is even for cognitive performance.
Why is the cognitive performance number bigger? Well, it’s possibly because there’s less data on cognitive performance, so the estimates are based on more obvious or easy-to-find effects. The final, predictive power of the direct effects for EA and for cognitive performance is similar, around 3% of the variance, if I’m reading it correctly (not sure about this). So the ratios are somewhat different, but the population GWAS predictive power is also somewhat different in the opposite direction, and these mostly cancel out.
Your OP is completely misleading if you’re using plain GWAS!
GWAS is an association—that’s what the A stands for. Association is not causation. Anything that correlates with IQ (eg melanin) can show up in a GWAS for IQ. You’re gonna end up editing embryos to have lower melanin and claiming their IQ is 150
Are your IQ gain estimates based on plain GWAS or on family-fixed-effects-GWAS? You don’t clarify. The latter would give much lower estimates than the former
And these changes in chickens are mostly NOT the result of new mutations, but rather the result of getting all the big chicken genes into a single chicken.
Is there a citation for this? Or is that just a guess
Calculating these probabilities is fairly straightforward if you know some theory of generating functions. Here’s how it works.
Let be a variable representing the probability of a single 6, and let represent the probability of “even but not 6”. A single string consisting of even numbers can be written like, say, , and this expression (which simplifies to ) is the same as the probability of the string. Now let’s find the generating function for all strings you can get in (A). These strings are generated by the following unambiguous regular expression:
The magical property of generating functions is that if you have an unambiguous regular expression, the corresponding generating function is easy to calculate: concatenation becomes product, union becomes sum, and star becomes the function . Using this, the generating function for the strings in (A) is
.
Similarly, the strings possible in (B) have unambiguous regular expression and generating function .
If you plug in the probabilities , the above functions will give you the probability of a string in (A) occurring and of a string in (B) occurring, respectively. But that’s not what we want; we want conditional expectations. To get that, we need the probability of each string to be weighted by its length (then to divide by the overall probability). The length of a string is the number of and variables in it—its degree. So we can get the sum of lengths-times-probabilities by scaling and by , taking a derivative with respect to , and plugging in . Then we divide by the overall probabilities. So the conditional expectations are
Now just plug in to get the conditional expectations.
There’s still my original question of where the feedback comes from. You say keep the transcripts where the final answer is correct, but how do you know the final answer? And how do you come up with the question?
What seems to be going on is that these models are actually quite supervised, despite everyone’s insistence on calling them unsupervised RL. The questions and answers appear to be high-quality human annotation instead of being machine generated. Let me know if I’m wrong about this.
If I’m right, it has implications for scaling. You need human annotators to scale, and you need to annotate increasingly hard problems. You don’t get to RL your way to infinite skill like alphazero; if, say, the Riemann hypothesis turns out to be like 3 OOMs of difficulty beyond what humans can currently annotate, then this type of training will never solve Riemann no matter how you scale.
- Jan 31, 2025, 6:33 PM; 13 points) 's comment on Catastrophe through Chaos by (
I have no opinion about whether formalizing proofs will be a hard problem in 2025, but I think you’re underestimating the difficulty of the task (“math proofs are math proofs” is very much a false statement for today’s LLMs, for example).
In any event, my issue is that formalizing proofs is very clearly not involved in the o1/o3 pipeline, since those models make so many formally incorrect arguments. The people behind FrontierMath have said that o3 solved many of the problems using heuristic algorithms with wrong reasoning behind them; that’s not something a model trained on formally verified proofs would do. I see the same thing with o1, which was evaluated on the Putnam and got the right answer with a wrong proof on nearly every question.
Well the final answer is easy to evaluate. And like in rStar-Math, you can have a reward model that checks if each step is likely to be critical to a correct answer, then it assigns and implied value to the step.
Why is the final answer easy to evaluate? Let’s say we generate the problem “number of distinct solutions to x^3+y^3+xyz=0 modulo 17^17” or something. How do you know what the right answer is?
I agree that you can do this in a supervised way (a human puts in the right answer). Is that what you mean?
What about if the task is “prove that every integer can be written as the sum of at most 1000 different 11-th powers”? You can check such a proof in Lean, but how do you check it in English?
And like in rStar-Math, you can have a reward model that checks if each step is likely to be critical to a correct answer, then it assigns and implied value to the step.
My question is where the external feedback comes from. “Likely to be critical to a correct answer” according to whom? A model? Because then you don’t get the recursive self-improvement past what that model knows. You need an external source of feedback somewhere in the training loop.
Do you have a sense of where the feedback comes from? For chess or Go, at the end of the day, a game is won or lost. I don’t see how to do this elsewhere except for limited domains like simple programming which can quickly be run to test, or formal math proofs, or essentially tasks in NP (by which I mean that a correct solution can be efficiently verified).
For other tasks, like summarizing a book or even giving an English-language math proof, it is not clear how to detect correctness, and hence not clear how to ensure that a model like o5 doesn’t give a worse output after thinking/searching a long time than the output it would give in its first guess. When doing RL, it is usually very important to have non-gameable reward mechanisms, and I don’t see that in this paradigm.
I don’t even understand how they got from o1 to o3. Maybe a lot of supervised data, ie openAI internally created some FrontierMath style problems to train on? Would that be enough? Do you have any thoughts about this?
The value extractable is rent on both the land and the improvement. LVT taxes only the former. E.g. if land can earn $10k/month after an improvement of $1mm, and if interest is 4.5%, and if that improvement is optimal, a 100% LVT is not $10k/mo but $10k/mo minus $1mm*0.045/12=$3,750. So 100% LVT would be merely $6,250.
If your improvement can’t extract $6.3k from the land, preventing you from investing in that improvement is a feature, not a bug.
If you fail to pay the LVT you can presumably sell the improvements. I don’t think there’s an inefficiency here—you shouldn’t invest in improving land if you’re not going to extract enough value from it to pay the LVT, and this is a feature, not a bug (that investment would be inefficient).
LVT applies to all land, but not to the improvements on the land.
We do not care about disincentivizing an investment in land (by which I mean, just buying land). We do care about disincentivizing investments in improvements on the land (by which I include buying the improvement on the land, as well as building such improvements). A signal of LVT intent will not have negative consequences unless it is interpreted as a signal of broader confiscation.
More accurately, it applies to a signalling of intent of confiscating other investments; we don’t actually care if people panic about land being confiscated because buying land (rather than improving it) isn’t productive in any way. (We may also want to partially redistribute resources towards the losers of the land confiscation to compensate for the lost investment—that is, we may want to the government to buy the land rather than confiscate it, though it would be bought at lower than market prices.)
It is weird to claim that the perceived consequence of planned incrementalism is “near-future governments want the money now, and will accelerate it”. The actual problem is almost certainly the opposite: near-future governments will want to cut taxes, since cutting taxes is incredibly popular, and will therefore stop or reverse the planned incremental LVT.
Thanks for this post. A few comments:
The concern about new uses of land is real, but very limited compared to the inefficiencies of most other taxes. It is of course true that if the government essentially owns the land to rent it out, the government should pay for the exploration for untapped oil reserves! The government would hire the oil companies to explore. It is also true that the government would do so less efficiently than the private market. But this is small potatoes compared to the inefficiency of nearly every other tax.
It is true that a developer owning multiple parcels of land would have lower incentives to improve any one of them, but this sounds like a very small effect to me, because most developers own a very (very) small part of the city’s land! In any case, the natural remedy here is for the government to subsidize all improvements on land, since improvements have positive externalities. Note that this is the opposite of the current property tax regime in most places (where improving the land makes you pay tax). In fact, replacing property taxes with land value taxes would almost surely incentivize developers to develop, even if they own multiple parcels of land. In other words, your objection already applies to the current world (with property taxes) and arguably applies less to the hypothetical world with land value taxes.
Estimates for the land value proportion of US GDP run significantly higher than the World Bank estimate, from what I understand. Land is a really big deal in the US economy.
“The government has incentives to inflate their estimates of the value of unimproved land” sure, the government always has incentives towards some inefficiencies; this objection applies to all government action. We have to try it and see how bad this is in practice.
The disruption and confidence-in-property-rights effects are potentially real, but mostly apply to sudden, high LVT. Most people’s investments already account for some amount of “regulatory risk”, the risk that the government changes the rules (e.g. with regards to capital gains taxes or property taxes). A move like “replace all property taxes with LVT” would be well within the expected risk. I agree that a sudden near-100% LVT would be too confidence-shaking; but even then, the question is whether people would view this as “government changes rules arbitrarily” or “government is run by competent economists now and changes rules suddenly but in accordance with economic theory”. A bipartisan shift towards economic literacy would lead people towards the latter conclusion, which means less panic about confiscated investments and more preemptive panic about (e.g.) expected Pigouvian taxes (this is a good thing). But a partisan change enacted when one party has majority and undone by the other party would lead people towards the former conclusion (with terrible consequences). Anyway, I am a big supporter of incrementalism and avoiding sudden change.
“The purported effect of an LVT on unproductive land speculation seems exaggerated” yes, I agree, and this always bothered me about LVT proponents.
You should show your calculation or your code, including all the data and parameter choices. Otherwise I can’t evaluate this.
I assume you’re picking parameters to exaggerate the effects, because just from the exaggerations you’ve already conceded (0.9/0.6 shouldn’t be squared and attenuation to get direct effects should be 0.824), you’ve already exaggerated the results by a factor of sqrt(0.9/0.6)/0.824 for editing, which is around a 50% overestimate.
I don’t think that was deliberate on your part, but I think wishful thinking and the desire to paint a compelling story (and get funding) is causing you to be biased in what you adjust for and in which mistakes you catch. It’s natural in your position to scrutinize low estimates but not high ones. So to trust your numbers I’d need to understand how you got them.