I doubt very much that it is something along the lines of random seeds that have made the difference between quality of various Sonnet 3.x runs. I expect it’s much more like them experimenting with different datasets (including different sorts of synthetic data).
As for Opus 3.5, Dario keeps saying that it’s in the works and will come out eventually. The way he says this does seem like they’ve either hit some unexpected snag (as you imply) or that they deprioritized this because they are short on resources (engineers and compute) and decided it was better to focus on improving their smaller models. The recent release of a new Sonnet 3.5 and Haiku 3.5 nudges me in the direction of thinking that they’ve chosen to prioritize smaller models. The reasons they made this choice are unclear. Has Opus 3.5 had work put in, but turned out disappointing so far? Does the inference cost (including opportunity cost of devoting compute resources to inefficient inference) make the economics seem unfavorable, even though actually Opus 3.5 is working pretty well? (Probably not overwhelmingly well, or they’d likely find some way to show it off even if they didn’t open it up for public API access.)
Are they rushing now to scale inference-time compute in o1/deepseek style? Almost certainly, they’d be crazy not to. Probably they’d done some amount of this internally already for generating higher quality synthetic data of reasoning traces. I don’t know how soon we should expect to see a public facing version of their inference-time-compute-scaled experiments though. Maybe they’ll decide to just keep it internal for a while, and use it to help train better versions of Sonnet and Haiku? (Maybe also a private internal version of Opus, which in turn helps generate better synthetic data?)
It’s all so hard to guess at, I feel quite uncertain.
I doubt very much that it is something along the lines of random seeds that have made the difference between quality of various Sonnet 3.x runs. I expect it’s much more like them experimenting with different datasets (including different sorts of synthetic data).
As for Opus 3.5, Dario keeps saying that it’s in the works and will come out eventually. The way he says this does seem like they’ve either hit some unexpected snag (as you imply) or that they deprioritized this because they are short on resources (engineers and compute) and decided it was better to focus on improving their smaller models. The recent release of a new Sonnet 3.5 and Haiku 3.5 nudges me in the direction of thinking that they’ve chosen to prioritize smaller models. The reasons they made this choice are unclear. Has Opus 3.5 had work put in, but turned out disappointing so far? Does the inference cost (including opportunity cost of devoting compute resources to inefficient inference) make the economics seem unfavorable, even though actually Opus 3.5 is working pretty well? (Probably not overwhelmingly well, or they’d likely find some way to show it off even if they didn’t open it up for public API access.)
Are they rushing now to scale inference-time compute in o1/deepseek style? Almost certainly, they’d be crazy not to. Probably they’d done some amount of this internally already for generating higher quality synthetic data of reasoning traces. I don’t know how soon we should expect to see a public facing version of their inference-time-compute-scaled experiments though. Maybe they’ll decide to just keep it internal for a while, and use it to help train better versions of Sonnet and Haiku? (Maybe also a private internal version of Opus, which in turn helps generate better synthetic data?)
It’s all so hard to guess at, I feel quite uncertain.