I experimented a bunch with DeepSeek today, it seems to be exactly on the same level in highs school competition math as o1-preview in my experiments. So I don’t think it’s benchmark-gaming, at least in math. On the other hand, it’s noticeably worse than even the original GPT-4 at understanding a short story I also always test models on.
I think it’s also very noteworthy that DeepSeek gives everyone 50 free messages a day (!) with their CoT model, while OpenAI only gives 30 o1-preview messages a week to subscribers. I assume they figured out how to run it much cheaper, but I’m confused in general.
A positive part of the news is that unlike o1, they show their actual chain of thought, and they promise to make their model open-source soon. I think this is really great for the science of studying faithful chain of thought.
From the experiments I have run, it looks like it is doing clear, interpretable English chain of thought (though with an occasional Chinese character once in a while), and I think it didn’t really yet start evolving into optimized alien gibberish. I think this part of the news is a positive update.
I experimented a bunch with DeepSeek today, it seems to be exactly on the same level in highs school competition math as o1-preview in my experiments. So I don’t think it’s benchmark-gaming, at least in math. On the other hand, it’s noticeably worse than even the original GPT-4 at understanding a short story I also always test models on.
I think it’s also very noteworthy that DeepSeek gives everyone 50 free messages a day (!) with their CoT model, while OpenAI only gives 30 o1-preview messages a week to subscribers. I assume they figured out how to run it much cheaper, but I’m confused in general.
A positive part of the news is that unlike o1, they show their actual chain of thought, and they promise to make their model open-source soon. I think this is really great for the science of studying faithful chain of thought.
From the experiments I have run, it looks like it is doing clear, interpretable English chain of thought (though with an occasional Chinese character once in a while), and I think it didn’t really yet start evolving into optimized alien gibberish. I think this part of the news is a positive update.
Yeah, I really hope they do actually open-weights it because the science of faithful CoT would benefit greatly.