QwQ-32B-Preview was released open-weights, seems comparable to o1-preview. Unless they’re gaming the benchmarks, I find it both pretty impressive and quite shocking that a 32B model can achieve this level of performance. Seems like great news vs. opaque (e.g in one-forward-pass) reasoning. Less good with respect to proliferation (there don’t seem to be any [deep] algorithmic secrets), misuse and short timelines.
From proliferation perspective, it reduces overhang, makes it more likely that Llama 4 gets long reasoning trace post-training in-house rather than later, and so initial capability evaluations give more relevant results. But if Llama 4 is already training, there might not be enough time for the technique to mature, and Llamas have been quite conservative in their techniques so far.
I think they meant that as an analogy to how developed/sophisticated it was (ie they’re saying that it’s still early days for reasoning models and to expect rapid improvement), not that the underlying model size is similar.
IIRC OAers also said somewhere (doesn’t seem to be in the blog post, so maybe this was on Twitter?) that o1 or o1-preview was initialized from a GPT-4 (a GPT-4o?), so that would also rule out a literal parameter-size interpretation (unless OA has really brewed up some small models).
At the same meeting, company leadership gave a demonstration of a research project involving its GPT-4 AI model that OpenAI thinks showssome new skills that rise to human-like reasoning, according to a person familiar with the discussion who asked not to be identified because they were not authorized to speak to press.
QwQ-32B-Preview was released open-weights, seems comparable to o1-preview. Unless they’re gaming the benchmarks, I find it both pretty impressive and quite shocking that a 32B model can achieve this level of performance. Seems like great news vs. opaque (e.g in one-forward-pass) reasoning. Less good with respect to proliferation (there don’t seem to be any [deep] algorithmic secrets), misuse and short timelines.
From proliferation perspective, it reduces overhang, makes it more likely that Llama 4 gets long reasoning trace post-training in-house rather than later, and so initial capability evaluations give more relevant results. But if Llama 4 is already training, there might not be enough time for the technique to mature, and Llamas have been quite conservative in their techniques so far.
There have been comments from OAI staff that o1 is “GPT-2 level” so I wonder if it’s a similar size?
I think they meant that as an analogy to how developed/sophisticated it was (ie they’re saying that it’s still early days for reasoning models and to expect rapid improvement), not that the underlying model size is similar.
IIRC OAers also said somewhere (doesn’t seem to be in the blog post, so maybe this was on Twitter?) that o1 or o1-preview was initialized from a GPT-4 (a GPT-4o?), so that would also rule out a literal parameter-size interpretation (unless OA has really brewed up some small models).
There was an article about it before the release.
https://archive.is/IwKSP
(Relevant, although “involving its GPT-4 AI model” is a considerably weaker statement than ‘initialized from a GPT-4 checkpoint’.)