DeepSeek beats o1-preview on math, ties on coding; will release weights

Zach Stein-Perlman20 Nov 2024 23:50 UTC

100 points

DeepSeek-R1-Lite-Preview was announced today. It’s available via chatbot. (Post; translation of Chinese post.)

DeepSeek says it will release the weights and publish a report.

The model appears to be stronger than o1-preview on math, similar on coding, and weaker on other tasks.

DeepSeek is Chinese. I’m not really familiar with the company. I thought Chinese companies were at least a year behind the frontier; now I don’t know what to think and hope people do more evals and play with this model. Chinese companies tend to game benchmarks more than the frontier Western companies, but I think DeepSeek hasn’t gamed benchmarks much historically.

The post also shows inference-time scaling, like o1:

Note that o1 is substantially stronger than o1-preview; see the o1 post:

(Parts of this post and some of my comments are stolen from various people-who-are-not-me.)

Zach Stein-Perlman20 Nov 2024 23:50 UTC

100 points

12 comments1 min readLW link

David Matolcsi 21 Nov 2024 0:05 UTC
34 points
7
I experimented a bunch with DeepSeek today, it seems to be exactly on the same level in highs school competition math as o1-preview in my experiments. So I don’t think it’s benchmark-gaming, at least in math. On the other hand, it’s noticeably worse than even the original GPT-4 at understanding a short story I also always test models on.
I think it’s also very noteworthy that DeepSeek gives everyone 50 free messages a day (!) with their CoT model, while OpenAI only gives 30 o1-preview messages a week to subscribers. I assume they figured out how to run it much cheaper, but I’m confused in general.
A positive part of the news is that unlike o1, they show their actual chain of thought, and they promise to make their model open-source soon. I think this is really great for the science of studying faithful chain of thought.
From the experiments I have run, it looks like it is doing clear, interpretable English chain of thought (though with an occasional Chinese character once in a while), and I think it didn’t really yet start evolving into optimized alien gibberish. I think this part of the news is a positive update.
- Daniel Kokotajlo 21 Nov 2024 0:18 UTC
  38 points
  14
  Parent
  Yeah, I really hope they do actually open-weights it because the science of faithful CoT would benefit greatly.
- Joel Burget 21 Nov 2024 16:33 UTC
  2 points
  0
  Parent
  though with an occasional Chinese character once in a while
  The Chinese characters sound potentially worrying. Do they make sense in context? I tried a few questions but didn’t see any myself.
  - Rauno Arike 21 Nov 2024 19:01 UTC
    13 points
    0
    Parent
    I saw them in 10-20% of the reasoning chains. I mostly played around with situational awareness-flavored questions, I don’t know whether the Chinese characters are more or less frequent in the longer reasoning chains produced for difficult reasoning problems. Here are some examples:
    The translation of the Chinese words here (according to GPT) is “admitting to being an AI.”
    
    This is the longest string in Chinese that I got. The English translation is “It’s like when you see a realistic AI robot that looks very much like a human, but you understand that it’s just a machine controlled by a program.”
    
    The translation here is “mistakenly think.”
    
    Here, the translation is “functional scope.”
    
    So, seems like all of them are pretty direct translations of the English words that should be in place of the Chinese ones, which is good news. It’s also reassuring to me that none of the reasoning chains contained sentences or paragraphs that looked out of place or completely unrelated to the rest of the response.
  - David Matolcsi 21 Nov 2024 20:43 UTC
    3 points
    0
    Parent
    I think it only came up once for a friend. I translated it and it makes sense, it just leaves replaces the appropriate English verb with a Chinese one in the middle of a sentence. (I note that this often happens with me to when I talk with my friends in Hungarian, I’m sometimes more used to the English phrase for something, and say one word in English in the middle of the sentence.)
Zach Stein-Perlman 21 Nov 2024 2:40 UTC
13 points
0
Claim by SemiAnalysis guy: DeepSeek has over 50,000 H100s. (And presumably they don’t spend much compute on inference.)
Mateusz Bagiński 21 Nov 2024 9:41 UTC
11 points
0
It’s predictably censored on CCP-sensitive topics.
(In a different chat.) After the second question, it typed two lines (something like “There have been several attempts to compare Winnie the Pooh to a public individual...”) and then overwrote it with “Sorry...”.
- dirk 21 Nov 2024 18:50 UTC
  22 points
  0
  Parent
  Breakable with some light obfuscation (the misspelling is essential here, as otherwise a circuit breaker will kick in):
nikola 21 Nov 2024 0:29 UTC
8 points
2
One weird detail I noticed is that in DeepSeek’s results, they claim GPT-4o’s pass@1 accuracy on MATH is 76.6%, but OpenAI claims it’s 60.3% in their o1 blog post. This is quite confusing as it’s a large difference that seems hard to explain with different training checkpoints of 4o.
- Lech Mazur 21 Nov 2024 5:14 UTC
  7 points
  0
  Parent
  It seems that 76.6% originally came from the GPT-4o announcement blog post. I’m not sure why it dropped to 60.3% by the time of o1′s blog post.
Zach Stein-Perlman 21 Nov 2024 2:40 UTC
2 points
0
Some context: interview translation; June FT article on DeepSeek; parent hedge fund Wikipedia page.
Mateusz Bagiński 21 Nov 2024 10:35 UTC
1 point
0
FYI if you want to use o1-like reasoning, you need to check off “Deep Think”.