I wondered what are O3 and and O4-mini? Here’s my guess at the test-time-scaling and how openai names their model
O0 (Base model)
↓
D1 (Outputs/labels generated with extended compute: search/reasoning/verification)
↓
O1 (Model trained on higher-quality D1 outputs)
↓
O1-mini (Distilled version - smaller, faster)
↓
D2 (Outputs/labels generated with extended compute: search/reasoning/verification)
↓
O2 (Model trained on higher-quality D2 outputs)
↓
O2-mini (Distilled version - smaller, faster)
↓
...
The point is consistently applying additional compute at generation time to create better training data for each subsequent iteration. And the models go from large -(distil)-> small -(search)-> large
Has anyone managed to replicate COCONUT? I’ve been trying to experiment with adding explainability through sparse linear bottlenecks, but as far as I have found: no one has replicated it.