93% in 2025 FEELS high, but … Meta was already low for 2023, median 83% but GPT-4 scores 86.4%. If you plot 100%-MMLU sota training compute FLOPS (e.g. RoBERTa with 1.32*10^21 flops scores 27.9% so 72.1% gap, GPT-3.5 @ 3.14*10^23=30% gap, GPT4 @ ~1.742*10^25=13.6% gap), it should take roughly 41x the training compute of GPT-4 to achieve 93.1% … so it totally checks.
(My estimate for GPT4 compute is based on the 1 trillion parameter leak, approximate number of V100 GPUs they have—they didn’t have A100 let alone H100 in hand during GPT4 training interval—possible range of training intervals scaling laws and the training time=8TP/nx law, etc., and ran some squiggles … it should be taken with a grain of salt, but the final number doesn’t change in very meaningful ways for any reasonable assumptions so, e.g., it might take 20x or 80x but it’s not going to 500x the training compute to get to 93%)
93% in 2025 FEELS high, but … Meta was already low for 2023, median 83% but GPT-4 scores 86.4%. If you plot 100%-MMLU sota training compute FLOPS (e.g. RoBERTa with 1.32*10^21 flops scores 27.9% so 72.1% gap, GPT-3.5 @ 3.14*10^23=30% gap, GPT4 @ ~1.742*10^25=13.6% gap), it should take roughly 41x the training compute of GPT-4 to achieve 93.1% … so it totally checks.
(My estimate for GPT4 compute is based on the 1 trillion parameter leak, approximate number of V100 GPUs they have—they didn’t have A100 let alone H100 in hand during GPT4 training interval—possible range of training intervals scaling laws and the training time=8TP/nx law, etc., and ran some squiggles … it should be taken with a grain of salt, but the final number doesn’t change in very meaningful ways for any reasonable assumptions so, e.g., it might take 20x or 80x but it’s not going to 500x the training compute to get to 93%)