Also worth noting that Claude 3 does not substantially advance the LLM capabilities frontier! [..]
I wrote that before I had the chance to try replacing Claude 3 with GPT-4 in my daily workflow, based on its LLM benchmark scores compared to gpt-4-turbo variants. After having used it for a full day, I do feel like Claude 3 has noticeable advantages over GPT-4 in ways that aren’t captured by said benchmarks. So while I stand behind my claim that it “does not substantially advance the LLM capabilities frontier”, I do think that Claude 3 Opus is advancing the frontier at least a little.
In my experience, it seems to have noticeably better on coding and mathematical reasoning tasks, which was surprising to me given that it does worse on HumanEval and MATH. I guess they focused on delivering practically useful intelligence as opposed to optimizing for the benchmarks? (Or even optimized against the benchmarks?)
(EDIT: it’s also much better at convincing me that its made up math is real, lol)
I’d like to caveat the comment you quoted above:
I wrote that before I had the chance to try replacing Claude 3 with GPT-4 in my daily workflow, based on its LLM benchmark scores compared to gpt-4-turbo variants. After having used it for a full day, I do feel like Claude 3 has noticeable advantages over GPT-4 in ways that aren’t captured by said benchmarks. So while I stand behind my claim that it “does not substantially advance the LLM capabilities frontier”, I do think that Claude 3 Opus is advancing the frontier at least a little.
In my experience, it seems to have noticeably better on coding and mathematical reasoning tasks, which was surprising to me given that it does worse on HumanEval and MATH. I guess they focused on delivering practically useful intelligence as opposed to optimizing for the benchmarks? (Or even optimized against the benchmarks?)
(EDIT: it’s also much better at convincing me that its made up math is real, lol)