I found this interesting tweet by Yao Fu who notes that:
Benchmarks that really differentiating the model performance are MATH and GPQA. These super hard problems are what AI should be aiming next. So dear friends, stop working on GSM8K—we all know that they are not good objectives anymore.
I agree that GSM8K has been pretty saturated (for the best frontier models) since ~GPT-4, and GPQA is designed to be a hard-to-saturated benchmark (though given the pace of progress...).
But why are HumanEval and MMLU also considered saturated? E.g. Opus and 4-Turbo are both significantly better than all other publicly known models on both benchmarks on both. And at least for HumanEval, I don’t see why >95% accuracy isn’t feasible.
It seems plausible that MMLU/HumanEval could be saturated after GPT-4.5 or Gemini 1.5 Ultra, at least for the best frontier models. And it seems fairly likely we’ll see them saturated in 2-3 years. But it seems like a stretch to call them saturated right now.
Is the reasoning for this is that Opus gets only 0.4% better on MMLU than the March GPT-4? That seems like pretty invalid reasoning, akin to deducing that because two runners achieve the same time, that that time is the best human-achievable time. And this doesn’t apply to HumanEval, where Opus gets ~18% better than March GPT-4 and the November 4-Turbo gets 2.9% better than Opus.
I found this interesting tweet by Yao Fu who notes that:
I agree that GSM8K has been pretty saturated (for the best frontier models) since ~GPT-4, and GPQA is designed to be a hard-to-saturated benchmark (though given the pace of progress...).
But why are HumanEval and MMLU also considered saturated? E.g. Opus and 4-Turbo are both significantly better than all other publicly known models on both benchmarks on both. And at least for HumanEval, I don’t see why >95% accuracy isn’t feasible.
It seems plausible that MMLU/HumanEval could be saturated after GPT-4.5 or Gemini 1.5 Ultra, at least for the best frontier models. And it seems fairly likely we’ll see them saturated in 2-3 years. But it seems like a stretch to call them saturated right now.
Is the reasoning for this is that Opus gets only 0.4% better on MMLU than the March GPT-4? That seems like pretty invalid reasoning, akin to deducing that because two runners achieve the same time, that that time is the best human-achievable time. And this doesn’t apply to HumanEval, where Opus gets ~18% better than March GPT-4 and the November 4-Turbo gets 2.9% better than Opus.