gwern
Who right now is standing on the sidelines with a killer AI app that could rip up the market if only tokens were a bit cheaper?
OpenAI’s Deep Research is looking like something that could be big and they were standing on the sidelines in part because the tokens weren’t cheap.
Most people do not read many books or spend time in spaces where SAT vocab words would be used at all. If that was the sole determinant, you would then expect any vocab test to fail catastrophically and not predict/discriminate in most of the population (which would have downstream consequences like making SATs weirdly unreliable outside the elite colleges or much less predictive validity for low-performing demographics, the former of which I am unaware of being true and the latter of which I know is false); this would further have the surprising consequence that if a vocab test is, say, r = 0.5 with g while failing catastrophically on most of the population, it would have to be essentially perfectly correlated r = 1 in the remainder to even be arithmetically possible, which just punts the question: how did two book-readers come away from that book with non-overlapping vocabs...?
I have good vocabulary, e.g. 800 on GRE verbal, but feel like I have a pretty bad memory for words and terms that I’ve only seen a few times.
How could you possibly know something like that?
One benefit of his ‘no-nut January’ is that by cutting out peanuts entirely, he’s also avoiding problems from oxalates. I would expect powdered peanut butter to be as dangerous in that regard.
And yet, despite the SAT being so studied for, it remains a pretty good IQ test overall, and SAT-V or the GRE verbal parts OK. I think that’s because there are so many words (500k+ in English, and the GRE-V has no compunction about mining the obscurest just to f—with you), and you would have to study so many in order to meaningful inflate your scores (because after all, while there may be only a hundred ‘vocab words’ on any given SAT test, you don’t know which hundred). Let’s see… Here’s an interesting-looking reference: “How Many Words Do We Know? Practical Estimates of Vocabulary Size Dependent on Word Definition, the Degree of Language Input and the Participant’s Age”, Brysbaert et al 2016
an average 20-year-old native speaker of American English knows 42,000 lemmas and 4,200 non-transparent multiword expressions, derived from 11,100 word families. The numbers range from 27,000 lemmas for the lowest 5% to 52,000 for the highest 5%. Between the ages of 20 and 60, the average person learns 6,000 extra lemmas or about one new lemma every 2 days.
So, if you wanted to boost your score from the mean to the 95th percentile, that seems to imply that you’d have to memorize 10,000 ‘lemmas’ (“Uninflected word from which all inflected words are derived”). That’s a big number, and then you have to ask how much work that would be.
If you did this in the optimal way with spaced repetition (ignoring the time it takes to figure out the 10k you want to memorize in the first place or the time to construct the flashcards or any penalty from needing to inefficiently cram them for an upcoming SAT instead of life-long efficient review), which of course still few students do, as spaced repetition systems remain a niche outside of medical school & foreign language study, the SuperMemo rough estimate is a long-term investment of 5 minutes per flashcard, and we’ll assume 1 lemma = 1 flashcard. That means you have to invest 10,00 * 5 = 50,000 minutes or 833 hours of studying! Meanwhile, hardly anyone is doing more than 8 hours of studying for the SAT as a whole (among the kids I knew at a prep high school, many didn’t even do a weekend course, which would entail about 8 hours of classwork & study). 833 hours for vocab alone would be insane.
That’s why people generally learn vocab from passive exposure rather than targeted study. Because no one, not even the most teacher’s-pet student, wants to do that. And so vocab measures keep working.
then I think it is also very questionable whether the AI that wins wars is the most “advanced” AI. / People like Dario whose bread-and-butter is model performance invariably over-index on model performance, especially on benchmarks. But practical value comes from things besides the model; what tasks you use it for and how effective you are at deploying it.
Dario is about the last AI CEO you should be making this criticism of. Claude has been notable for a while for the model which somehow winds up being the most useful and having the best ‘vibes’, even when the benchmarks indicate it’s #2 or #3; and meanwhile, it is the Chinese models which historically regress the most from their benchmarks when applied (and DeepSeek models, while not as bad as the rest, still do this and r1 is already looking shakier as people try out heldout problems or benchmarks).
Only if you ignore that yesterday was when the Trump GPU tariffs would also be leaking and, pace event-studies, be expected to be changing prices too.
It’s not RL, but what is RL any more? It’s becoming blurry. They don’t reward or punish it for anything in the thought token. So it learns thoughts that are helpful in outputting the correct answer.
That’s definitely RL (and what I was explaining was simply the obvious basic approach anyone in DRL would think of in this context and so of course there is research trying things like it). It’s being rewarded for a non-differentiable global loss where the correct alternative or answer or label is not provided (not even information of the existence of a better decision) and so standard supervised learning is impossible, requiring exploration. Conceptually, this is little different from, say, training a humanoid robot NN to reach a distant point in fewer actions: it can be a hard exploration problem (most sequences of joint torques or actions simply result in a robot having a seizure while laying on the ground going nowhere), where you want to eventually reach the minimal sequence (to minimize energy / wear-and-tear / time) and you start by solving the problem in any way possible, rewarding solely on the final success, and then reward-shape into a desirable answer, which in effect breaks up the hard original problem into two more feasible problems in a curriculum - ‘reach the target ever’ followed by ‘improve a target-reaching sequence of actions to be shorter’.
While we’re at it, one example I learned afterwards was that the ‘caribou randomization’ story is probably bogus (excerpts):
We will show that hunters do not randomize their behavior, that caribou populations do not fluctuate according to human predation, and that scapulimancy apparently is not selected because it is ecologically advantageous. We shall also show that there is no cross-cultural evidence of divinatory random devices producing randomized subsistence behavior, but rather that people manipulate divination with the explicit or implicit intervention of personal choice.
What is particularly interesting to me is that the apparent beautiful match of this traditional hunting practice with contemporary game theory may be ‘too good to be true’ because it was actually the opposite: I suspect that the story was made up to launder (secret) game-theoretic work from WWII into academic writing; the original author’s career & funder are exactly where that sort of submarine-warfare operations-research idea would come from… (There were many cases post-WWII of civilians carefully laundering war or classified work into publishable form, which means that any history-of-ideas has to be cautious about taking at face value anything published 1940–1960 which looks even a little bit like cryptography, chemistry, physics, statistics, computer science, game theory, or operations research.)
Outputs of o1 don’t include reasoning traces, so not particularly useful compared to outputs of chatbot models, and very expensive, so only a modest amount can be collected.
It would be more precise to say outputs of o1 aren’t supposed to include the reasoning traces. But in addition to the reasoning traces OA voluntarily released, people have been observing what seem to be leaks, and given that the history of LLM robustness to jailbreaks can be summarized as ‘nil’, it is at least conceivable that someone used a jailbreak+API to exfiltrate a bunch of traces. (Remember that Chinese companies like ByteDance have definitely been willfully abusing the OA API for the purposes of knowledge distillation/cloning and evading bans etc, in addition to a history of extremely cutthroat tactics that FANG would blanch at, so it’s a priori entirely plausible that they would do such things.)
I don’t believe DeepSeek has done so, but it is technically possible. (Regardless of whether anyone has done so, it is now partially moot given that r1 traces in the DS paper, and based on third party reports thus far, work so well for distillation so everyone can kickstart their own r1-clone with r1 reasoning traces and work from there. There may be more reason to try to exfiltrate o3+ traces, but OA may also decide to not bother, as users are claiming to value and/or enjoy reading the raw traces, and since the secret & capability is out, maybe there’s not much point in hiding them any longer.)
There is also GreaterWrong, which I believe caches everything rather than passing through live, so it would be able to restore almost all publicly-visible content, in theory.
Right now, it seems to be important to not restrict the transcripts at all. This is a hard exploration problem, where most of the answers are useless, and it takes a lot of time for correct answers to finally emerge. Given that, you need to keep the criteria as relaxed as possible, as they are already on the verge of impossibility.
The r1, the other guys, and OAers too on Twitter now seem to emphasize that the obvious appealing approach of rewarding tokens for predicted correctness or doing search on tokens, just doesn’t work (right now). You need to ‘let the LLMs yap’ until they reach the final correct answer. This appears to be the reason for the bizarre non sequiturs or multi-lingual diversions in transcripts—that’s just the cost of rolling out solution attempts which can go anywhere and keeping the winners. They will do all sorts of things which are unnecessary (and conversely, omit tokens which are ‘necessary’). Think of it as the equivalent of how DRL agents will ‘jitter’ and take many unnecessary actions, because those actions don’t change the final reward more than epsilon, and the RL feedback just isn’t rich enough to say ‘you don’t need to bounce up and down randomly while waiting for the ball to bounce back, that doesn’t actually help or hurt you’ (and if you try to reward-shape away those wasteful movements, you may discover your DRL agent converges to a local optimum where it doesn’t do anything, ever, because the jitters served to explore the environment and find new tricks, and you made it too expensive to try useless-seeming tricks so it never found any payoffs or laddered its way up in capabilities).
So you wouldn’t want to impose constraints like ‘must be 100% correct valid Lean proof’. Because it is hard enough to find a ‘correct’ transcript even when you don’t penalize it for spending a while yapping in Japanese or pseudo-skipping easy steps by not writing them down. If you imposed constraints like that, instead of rolling out 1000 episodes and getting 1 useful transcript and the bootstrap working, you’d get 0 useful transcripts and it’d go nowhere.
What you might do is impose a curriculum: solve it any way you can at first, then solve it the right way. Once you have your o1 bootstrap working and have seen large capability gains, you can go back and retrain on the easiest problems with stricter criteria, and work your way back up through the capability levels, but now in some superior way. (In the DRL agent context, you might train to convergence and only then impose a very, very small penalty on each movement, and gradually ramp it up until the performance degrades a little bit but it’s no longer jittering.) The same way you might be taught something informally, and then only much later, after you’ve worked with it a lot, do you go back and learn or prove it rigorously. You might impose a progressive shrinking constraint, for example, where the transcript has to be fewer tokens each time, in order to distill the knowledge into the forward passes to make it vastly cheaper to run (even cheaper, for hard problems, than simply training a small dumb model on the transcripts). You might try to iron out the irrelevancies and digressions by having a judge/critic LLM delete irrelevant parts. You might try to eliminate steganography by rewriting the entire transcript using a different model. Or you might simply prompt it to write a proof in Lean, and score it by whether the final answer validates.
Fernando Boretti has a good 2022 post “Unbundling Tools for Thought” I don’t think I saw before, but which makes some of these points at greater length and I largely agree with.
Holden was previously Open Philanthropy’s CEO and is now settling into his new role at Anthropic.
Wait, what? When did Holden Karnofsky go to Anthropic? Even his website doesn’t mention that and still says he’s at Carnegie.
The shape of your face, and much else besides, will be affected by random chance and environmental influences during the process of development and growth.
The shape of your face will not be affected much by random chance and environmental influences. See: identical twins (including adopted apart).
There are other, more interesting and important ways to use that compute capacity. Nobody sane, human or alien, is going to waste it on running a crapton of simulations.
Counterpoint: speedrunning and things like ‘Twitch plays’, which are some of the most popular streaming genres in existence, and exist largely because they are unimportant. A TAS speedrunner may well run millions or billions of simulations simply to try to shave off 1s from the record. (An example I like to cite uses 6 CPU-years to bruteforce NES Arkanoid to achieve nearly optimal play. Unfortunately, he doesn’t provide the wallclock equivalent, but I strongly suspect that this project alone simulates more minutes of NES Arkanoid than it was ever played by humans. If not, then I’m quite sure at this point that NES Mario has been played in silico OOMs more than by humans. Plenty of projects like ‘My First NEAT project’ will do a few years or centuries of NES Mario.)
Why do you think that? Softbank, MS, Oracle, OpenAI etc are not governments, and the press release is not claiming to take any government money. Not to mention, this was to a considerable extent announced a year ago.
An important update: “Stargate” (blog) is now officially public, confirming earlier $100b numbers and some loose talk about ‘up to $500b’ being spent. Noam Brown commentary:
@OpenAI excels at placing big bets on ambitious research directions driven by strong conviction.
This is on the scale of the Apollo Program and Manhattan Project when measured as a fraction of GDP. This kind of investment only happens when the science is carefully vetted and people believe it will succeed and be completely transformative. I agree it’s the right time.
...I don’t think that’s the correct interpretation. DeepSeek shows you can get very powerful AI models with relatively little compute. But I have no doubt that with even more compute it would be an even more powerful model.
If r1 being comparable to o1 surprised you, your mistake was forgetting the 1 part. This is the early stage of a new paradigm, and SOTA is the cheapest it will ever be.
That does NOT mean compute doesn’t matter. (I’ve said roughly this before, but it bears repeating)
...Don’t get me wrong, DeepSeek is nothing to sneeze it.
They will almost certainly get much more compute than they have now. But so will OpenAI...
And if DeepSeek keeps up via compute, that does not invalidate the original point re: compute being key.
(This is an example of why I don’t expect DeepSeek to leapfrog OA/A/G/FB/xAI/SSI/et al: DS does great work, but $500b is a lot of money, and their capital disadvantage may be, if anything, bigger when you move from a raw parameter/data-scaling regime to an inference/search scaling regime. 6 million dollar training budgets aren’t cool. You know what’s cool? 6 million GPU training budgets...)
EDIT: the lead author, Daya Guo, of the r1 paper reportedly tweeted (before deleting):
The last work in 2024, nothing can stop us on the path to AGI, except for computational resources.
I’m sure it would be less flattering to me than my version, because people never remember these sorts of conversations the same way. If you think that it might not have happened like that, then just treat it as a hypothetical discussion that could have happened and ponder how contemporary Western lower-education systems can make truly transformative, rather than minor tinkering around the edges, use of AGI which preserves all existing compensation/status/prestige/job/political arrangements and which the teachers’ unions and pension plans would not be implacably opposed to.
It’s a good thing to think about if you are trying to gauge what sort of economic or societal changes might happen over the next decade, especially if you are trying to use that as a proxy for ‘is AGI real’, as so many people are. Personally, my conclusion has long been that the economy & society are so rigid that most such arrangements will remain largely intact even if they are dead men walking, and the pace of AI progress is so rapid that you should basically ignore any argument of the form ‘but we still have human teachers, therefore, AGI can’t be real’.
My point there is that he was talking to the reasoning team pre-hiring (forget ‘onboarding’, who knows what that means), so they would be unable to tell him most things—including if they have a better reason than ‘faith in divine benevolence’ to think that ‘more RL does fix it’.
But does that necessarily matter? Many of those models can’t use tools; and since much of the point of the end-to-end RL training of Deep Research is to teach tool use, showing DR results without tool use would be either irrelevant or misleading (eg. it might do worse than the original o3 model it is trained from, when deprived of the tools it is supposed to use).