gwern comments on DeepSeek beats o1-preview on math, ties on coding; will release weights

gwern 25 Nov 2024 18:07 UTC
52 points
35

DeepSeek is Chinese. I’m not really familiar with the company.

DeepSeek is the best Chinese DL research group now and have been for at least a year. If you are interested in the topic, you ought to learn more about them.

I thought Chinese companies were at least a year behind the frontier.

This seems roughly consistent with what you would expect. People usually say half a year to a year behind. Q* was invented somewhere in summer 2023, according to the OA coup reporting; ballpark June-July 2023, I got the impression since it seemed to already be a topic of discussion with the Board c. August 2023 pre-coup. Thus, we are now (~20 Nov 2024) almost at December 2024, which is about a year and a half. o1-preview was announced 12 September 2024, 74 days ago, and o1-preview’s benchmarks were much worse than the true o1 which was still training then (and of course, OA has kept improving it ever since, even if we don’t know how—remember, time is always passing†, and what you read in a blog post may already be ancient history). Opensource/competitor models (not just Chinese or DeepSeek specifically) have a long history of disappointing in practice when they turn out to be much narrower, overfit to benchmarks, or otherwise somehow lacking in quality & polish compared to the GPT-4s or Claude-3s.

So, if a competing model claims to match o1-preview from almost 3 months ago, which itself is far behind o1, with additional penalties from compensating for the hype and the apples-to-oranges comparisons, and where we still don’t know if they are actually the same algorithm at core (inasmuch as neither OA nor DeepSeek, AFAIK, have yet to publish any kind of detailed description of what Q*/Strawberry/r1 is), and possibly worst-case as much as >1.5 years behind if DS has gone down a dead end & has to restart… This point about time applies to any other Chinese replications as well, modulo details like possibly suggesting DeepSeek is not so good etc.

Overall, this still seems roughly what you would expect now: ‘half a year to a year behind’. It’s always a lot easier to catch up with an idea after someone else has proven it works and given you an awful lot of hints about how it probably works, like the raw sample transcripts. (I particularly note the linguistic tics in this DS version too, which I take as evidence for my inner-monologue splicing guess of how the Q* algorithm works.)

† I feel very silly pointing this out: that time keeps passing, and if you think that some new result is startling evidence against the stylized fact “Chinese DL is 6-12 months behind” that you should probably start by, well, comparing the new result to the best Western DL result 6–12 months ago! Every time you hear about a new frontier-pushing Western DL result, you should mentally expect a Chinese partial replication in 6–12 months, and around then, start looking for it. / This should be too obvious to even mention. And yet, I constantly get the feeling that people have been losing their sort of… “temporal numeracy”, for lack of a better phrase. That they live in a ‘Big Now’ where everything has happened squashed together. In the same way that in politics/economics, people will talk about the 1980s or 1990s as if all of that was just a decade ago instead of almost half a century ago (yes, really: 2024 − 1980 = 44), many people discussing AI seems to have strangely skewed mental timelines. / They talk like GPT-4 came out, like, a few months after GPT-3 did, maybe? So GPT-5 is wildly overdue! That if a Chinese video model matches OA Sora tomorrow, well, Sora was announced like, a month ago, something like that? So they’ve practically caught up! OA only just announced o1, and DeepSeek has already matched it! Or like 2027 is almost already here and they’re buying plane tickets for after Christmas. There’s been a few months without big news? The DL scaling story is over for good and it hit the wall!