dirk comments on Frontier Models are Capable of In-context Scheming

dirk 14 Dec 2024 8:59 UTC
1 point
0
Reading the Semianalysis post, it kind of sounds like it’s just their opinion that that’s what Anthropic did.
They say “Anthropic finished training Claude 3.5 Opus and it performed well, with it scaling appropriately (ignore the scaling deniers who claim otherwise – this is FUD)”—if they have a source for this, why don’t they mention it somewhere in the piece instead of implying people who disagree are malfeasors? That reads to me like they’re trying to convince people with force of rhetoric, which typically indicates a lack of evidence.

The previous is the biggest driver of my concern here, but the next paragraph also leaves me unconvinced. They go on to say “Yet Anthropic didn’t release it. This is because instead of releasing publicly, Anthropic used Claude 3.5 Opus to generate synthetic data and for reward modeling to improve Claude 3.5 Sonnet significantly, alongside user data. Inference costs did not change drastically, but the model’s performance did. Why release 3.5 Opus when, on a cost basis, it does not make economic sense to do so, relative to releasing a 3.5 Sonnet with further post-training from said 3.5 Opus?”
This does not make sense to me as a line of reasoning. I’m not aware of any reason that generating synthetic data would preclude releasing the model, and it seems obvious to me that Anthropic could adjust their pricing (or impose stricter message limits) if they would lose money by releasing at current prices. This seems to be meant as an explanation of why Anthropic delayed release of the purportedly-complete Opus model, but it doesn’t really ring true to me.
Is there some reason to believe them that I’m missing? (On a quick google it looks like none of the authors work directly for Anthropic, so it can’t be that they directly observed it as employees).