Anthropic’s … mediocre Sonnet 3.7. Well, perhaps they have even scarier models that they’re still not releasing? I mean, sure, maybe. But that’s a fairly extraordinary claim
Base model for Sonnet-3.7 was pretrained in very early 2024, and there was a recent announcement that a bigger model is coming soon, which is, obviously. So the best reasoning model they have internally is better than Sonnet 3.7, even though we don’t know if it’s significantly better. They might’ve had it since late 2024 even, but without Blackwell they can’t deploy, and also they are Anthropic, so plausibly capable of not deploying out of an abundance of caution.
The rumors about quality of Anthropic’s reasoning models didn’t specify which model they are talking about. So observation of Sonnet 3.7′s reasoning is not counter-evidence to the claim that verifiable task RL results scale well with pretraining, and only slight evidence that it doesn’t scale well with pure RL given an unchanged base model.
Base model for Sonnet-3.7 was pretrained in very early 2024, and there was a recent announcement that a bigger model is coming soon, which is, obviously. So the best reasoning model they have internally is better than Sonnet 3.7, even though we don’t know if it’s significantly better. They might’ve had it since late 2024 even, but without Blackwell they can’t deploy, and also they are Anthropic, so plausibly capable of not deploying out of an abundance of caution.
The rumors about quality of Anthropic’s reasoning models didn’t specify which model they are talking about. So observation of Sonnet 3.7′s reasoning is not counter-evidence to the claim that verifiable task RL results scale well with pretraining, and only slight evidence that it doesn’t scale well with pure RL given an unchanged base model.
Fair.