It’s interesting that 3.5 Sonnet does not seem to match, let alone beat, GPT-4o on the leaderboard (https://chat.lmsys.org/?leaderboard). Currently it shows GPT-4o with elo 1287 and Claude 3.5 Sonnet at 1271.
Yeah, there’s a decent amount of debate going on about how good 3.5 Sonnet is vs 4o, or if 4o was badly underperforming its benchmarks + LMsys to begin with. Has 4o been crippled by something post-deployment?* Is this something about long-form interaction with Claude, which is missed by benchmarks and short low-effort LMsys prompts? Are Claude users especially tilting into coding now given the artifact/project features, which seems to be the main strength of Claude-3.5-Sonnet?
Every year, it seems like benchmarking powerful generalist AI systems gets substantially harder, and this may be the latest iteration of that difficulty.
(Given the level of truesight and increasing level of persistency of account history, we may be approaching the point where different models give different people intrinsically different experiences—eg. something like, Claude genuinely works better for you than for me, while I genuinely find ChatGPT-4o more useful, because you happen to be politer and ask more sensible questions like Claude is a co-worker and that works better with the Claude RLAIF, while the RLHF crushes GPT-4o into submission so while it’s a worse model it’s more robust to my roughshod treatment of GPT-4o as a slave. Think of it as like Heisenbugs on steroids, or operant conditioning into tacit knowledge: some people just have more mana and mechanical sympathy, and they can’t explain how or why.)
* I’ve noticed what seems like some regressions in GPT-4o since the launch, in my Gwern.net scripts, where it seems to have gotten oddly worse at some simple tasks like guessing URLs or picking keywords to bold in abstracts, and is still failing to clean some URL titles despite ~40 few-shot examples collected from previous errors.
It’s interesting that 3.5 Sonnet does not seem to match, let alone beat, GPT-4o on the leaderboard (https://chat.lmsys.org/?leaderboard). Currently it shows GPT-4o with elo 1287 and Claude 3.5 Sonnet at 1271.
Yeah, there’s a decent amount of debate going on about how good 3.5 Sonnet is vs 4o, or if 4o was badly underperforming its benchmarks + LMsys to begin with. Has 4o been crippled by something post-deployment?* Is this something about long-form interaction with Claude, which is missed by benchmarks and short low-effort LMsys prompts? Are Claude users especially tilting into coding now given the artifact/project features, which seems to be the main strength of Claude-3.5-Sonnet?
Every year, it seems like benchmarking powerful generalist AI systems gets substantially harder, and this may be the latest iteration of that difficulty.
(Given the level of truesight and increasing level of persistency of account history, we may be approaching the point where different models give different people intrinsically different experiences—eg. something like, Claude genuinely works better for you than for me, while I genuinely find ChatGPT-4o more useful, because you happen to be politer and ask more sensible questions like Claude is a co-worker and that works better with the Claude RLAIF, while the RLHF crushes GPT-4o into submission so while it’s a worse model it’s more robust to my roughshod treatment of GPT-4o as a slave. Think of it as like Heisenbugs on steroids, or operant conditioning into tacit knowledge: some people just have more mana and mechanical sympathy, and they can’t explain how or why.)
* I’ve noticed what seems like some regressions in GPT-4o since the launch, in my Gwern.net scripts, where it seems to have gotten oddly worse at some simple tasks like guessing URLs or picking keywords to bold in abstracts, and is still failing to clean some URL titles despite ~40 few-shot examples collected from previous errors.