One weird detail I noticed is that in DeepSeek’s results, they claim GPT-4o’s pass@1 accuracy on MATH is 76.6%, but OpenAI claims it’s 60.3% in their o1 blog post. This is quite confusing as it’s a large difference that seems hard to explain with different training checkpoints of 4o.
There had been a number of papers published over the last year on how to do this kind of training, and for roughly a year now there have been rumors that OpenAI were working on it. If converting that into a working version is possible for a Chinese company like DeepSeek, as it appears, then why haven’t Anthropic and Google released versions yet? There doesn’t seem to be any realistic possibility that DeepSeek actually have more compute or better researchers than both Anthropic and Google.
One possible interpretation would be that this has significant safety implications, and Anthropic and Google are both still working through these before releasing.
Another possibility would be that Anthropic has in fact released, in the sense that their Claude models’ recent advances in agentic behavior (while not using inference-time scaling) are distilled from reasoning traces generated by an internal-only model of this type that is using inference-time scaling.
One weird detail I noticed is that in DeepSeek’s results, they claim GPT-4o’s pass@1 accuracy on MATH is 76.6%, but OpenAI claims it’s 60.3% in their o1 blog post. This is quite confusing as it’s a large difference that seems hard to explain with different training checkpoints of 4o.
It seems that 76.6% originally came from the GPT-4o announcement blog post. I’m not sure why it dropped to 60.3% by the time of o1′s blog post.
There had been a number of papers published over the last year on how to do this kind of training, and for roughly a year now there have been rumors that OpenAI were working on it. If converting that into a working version is possible for a Chinese company like DeepSeek, as it appears, then why haven’t Anthropic and Google released versions yet? There doesn’t seem to be any realistic possibility that DeepSeek actually have more compute or better researchers than both Anthropic and Google.
One possible interpretation would be that this has significant safety implications, and Anthropic and Google are both still working through these before releasing.
Another possibility would be that Anthropic has in fact released, in the sense that their Claude models’ recent advances in agentic behavior (while not using inference-time scaling) are distilled from reasoning traces generated by an internal-only model of this type that is using inference-time scaling.