Nikola Jurkovic comments on DeepSeek beats o1-preview on math, ties on coding; will release weights

Nikola Jurkovic 21 Nov 2024 0:29 UTC
8 points
2
One weird detail I noticed is that in DeepSeek’s results, they claim GPT-4o’s pass@1 accuracy on MATH is 76.6%, but OpenAI claims it’s 60.3% in their o1 blog post. This is quite confusing as it’s a large difference that seems hard to explain with different training checkpoints of 4o.
- Lech Mazur 21 Nov 2024 5:14 UTC
  11 points
  1
  Parent
  It seems that 76.6% originally came from the GPT-4o announcement blog post. I’m not sure why it dropped to 60.3% by the time of o1′s blog post.
- RogerDearnaley 24 Nov 2024 21:43 UTC
  4 points
  0
  Parent
  There had been a number of papers published over the last year on how to do this kind of training, and for roughly a year now there have been rumors that OpenAI were working on it. If converting that into a working version is possible for a Chinese company like DeepSeek, as it appears, then why haven’t Anthropic and Google released versions yet? There doesn’t seem to be any realistic possibility that DeepSeek actually have more compute or better researchers than both Anthropic and Google.
  One possible interpretation would be that this has significant safety implications, and Anthropic and Google are both still working through these before releasing.
  Another possibility would be that Anthropic has in fact released, in the sense that their Claude models’ recent advances in agentic behavior (while not using inference-time scaling) are distilled from reasoning traces generated by an internal-only model of this type that is using inference-time scaling.